Evaluating PTMS for User Feedback Analysis in SE

Pre-print | Slides

Context

  • Analyzing app reviews have proved to be useful for many areas of software engineering.
  • Automatic classification of app reviews requires extensive efforts to manually curate a labeled dataset.
  • Recent pre-trained neural language models (PTM) are trained on large corpora in an unsupervised manner and have found success in solving similar Natural Language Processing problems.

Objective

  • We investigate the benefits of PTMs for app review classification compared to the existing models,
  • We also examine the transferability of PTMs in multiple settings.

Proposed Method

  • We empirically study the accuracy and time efficiency of PTMs compared to prior approaches using six datasets from literature. We also investigate the performance of the PTMs trained on app reviews.
  • We set up studies to evaluate PTMs in multiple settings: binary vs. multi-class classification, zero-shot classification, multi-task setting, and classification of reviews from different resources.

Contributions

  • This is the first study that explores the difference between four PTMs and four existing tools/approaches on six different app review datasets with different sizes and labels.
  • We are the first to explore the performance of general versus domain-specific pre-trained PTMs for app review classification.
  • This is the first empirical study to examine the accuracy and efficiency of PTMs in four different settings: binary vs. multi-class classification, zero shot setting, multi-task setting, and setting in which training data is from one resource (e.g., App Store) and the model is tested on data from another platform (e.g., Twitter).

Research Questions

  • RQ1: How accurate and efficient are the PTMs in the classification of app reviews compared to the existing tools?

  • RQ2: How does the performance of the PTMs change when they are pre-trained on app-review dataset, instead of a generic dataset (e.g., Wiki-documents, book corpus)?

  • RQ3: How do the PTMs perform in the following settings? (a) Binary vs multi-class setting, (b) Zero-shot classification, (c) Multi-task setting (i.e., different app-review analysis tasks), (d) Classification of user-reviews collected from different resources.

Datasets

  • Dataset 1 (D1): This dataset is procured by Gu and Kim (contains 34,000 reviews from 17 popular Android apps; labeled with 5 classes)
  • Dataset 2 (D2): This dataset is procured by Stanik et al. (contains 6,406 app reviews from Google Play and 10,364 tweets which are labeled manually into three classes: Problem Report, Inquiry, and Irrelevant
  • Dataset 3 (D3): This dataset is provided by Lu and Liang (contains 2,000 review sentences from 2 apps, one app from Google Play and another from Apple App Store; the reviews are classified into six categories)
  • Dataset 4 (D4): This dataset was procured by Maalej and Nabil (contains 2,000 manually labeled reviews from random apps selected from top apps in different categories; the reviews are classified into four categories)
  • Dataset 5 (D5): This dataset is published by Guo et al. (contains 1,500 app reviews from randomly selected 151 apps from Apple App Store; the reviews are classified into three categories)
  • Dataset 6 (D6): This dataset was procured by Guzman et al.; (contains 1820 reviews of 3 apps from Apple App Store and 4 apps from Google Play; this dataset includes seven categories)

Questions?

Ask

Presentation