Evaluating PTMS for User Feedback Analysis in SE

Pre-print | Slides

Context

Analyzing app reviews have proved to be useful for many areas of software engineering.
Automatic classification of app reviews requires extensive efforts to manually curate a labeled dataset.
Recent pre-trained neural language models (PTM) are trained on large corpora in an unsupervised manner and have found success in solving similar Natural Language Processing problems.

Objective

We investigate the benefits of PTMs for app review classification compared to the existing models,
We also examine the transferability of PTMs in multiple settings.

Proposed Method

We empirically study the accuracy and time efficiency of PTMs compared to prior approaches using six datasets from literature. We also investigate the performance of the PTMs trained on app reviews.
We set up studies to evaluate PTMs in multiple settings: binary vs. multi-class classification, zero-shot classification, multi-task setting, and classification of reviews from different resources.

Contributions

This is the first study that explores the difference between four PTMs and four existing tools/approaches on six different app review datasets with different sizes and labels.
We are the first to explore the performance of general versus domain-specific pre-trained PTMs for app review classification.

This is the first empirical study to examine the accuracy and efficiency of PTMs in four different settings: binary vs. multi-class classification, zero shot setting, multi-task setting, and setting in which training data is from one resource (e.g., App Store) and the model is tested on data from another platform (e.g., Twitter).

Research Questions

RQ1: How accurate and efficient are the PTMs in the classification of app reviews compared to the existing tools?
RQ2: How does the performance of the PTMs change when they are pre-trained on app-review dataset, instead of a generic dataset (e.g., Wiki-documents, book corpus)?

RQ3: How do the PTMs perform in the following settings? (a) Binary vs multi-class setting, (b) Zero-shot classification, (c) Multi-task setting (i.e., different app-review analysis tasks), (d) Classification of user-reviews collected from different resources.

Datasets

Dataset 1 (D1): This dataset is procured by Gu and Kim (contains 34,000 reviews from 17 popular Android apps; labeled with 5 classes)
Dataset 2 (D2): This dataset is procured by Stanik et al. (contains 6,406 app reviews from Google Play and 10,364 tweets which are labeled manually into three classes: Problem Report, Inquiry, and Irrelevant

Dataset 3 (D3): This dataset is provided by Lu and Liang (contains 2,000 review sentences from 2 apps, one app from Google Play and another from Apple App Store; the reviews are classified into six categories)
Dataset 4 (D4): This dataset was procured by Maalej and Nabil (contains 2,000 manually labeled reviews from random apps selected from top apps in different categories; the reviews are classified into four categories)

Dataset 5 (D5): This dataset is published by Guo et al. (contains 1,500 app reviews from randomly selected 151 apps from Apple App Store; the reviews are classified into three categories)
Dataset 6 (D6): This dataset was procured by Guzman et al.; (contains 1820 reviews of 3 apps from Apple App Store and 4 apps from Google Play; this dataset includes seven categories)

Questions?