A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study

arXiv - CS - Software Engineering Pub Date : 2024-09-11 DOI:arxiv-2409.07162

Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma

{"title":"A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study","authors":"Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma","doi":"arxiv-2409.07162","DOIUrl":null,"url":null,"abstract":"Analyzing user reviews for sentiment towards app features can provide\nvaluable insights into users' perceptions of app functionality and their\nevolving needs. Given the volume of user reviews received daily, an automated\nmechanism to generate feature-level sentiment summaries of user reviews is\nneeded. Recent advances in Large Language Models (LLMs) such as ChatGPT have\nshown impressive performance on several new tasks without updating the model's\nparameters i.e. using zero or a few labeled examples. Despite these\nadvancements, LLMs' capabilities to perform feature-specific sentiment analysis\nof user reviews remain unexplored. This study compares the performance of\nstate-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for\nextracting app features and associated sentiments under 0-shot, 1-shot, and\n5-shot scenarios. Results indicate the best-performing GPT-4 model outperforms\nrule-based approaches by 23.6% in f1-score with zero-shot feature extraction;\n5-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting\npositive sentiment towards correctly predicted app features, with 5-shot\nenhancing it by 7%. Our study suggests that LLM models are promising for\ngenerating feature-specific sentiment summaries of user reviews.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07162","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Analyzing user reviews for sentiment towards app features can provide valuable insights into users' perceptions of app functionality and their evolving needs. Given the volume of user reviews received daily, an automated mechanism to generate feature-level sentiment summaries of user reviews is needed. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model's parameters i.e. using zero or a few labeled examples. Despite these advancements, LLMs' capabilities to perform feature-specific sentiment analysis of user reviews remain unexplored. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for extracting app features and associated sentiments under 0-shot, 1-shot, and 5-shot scenarios. Results indicate the best-performing GPT-4 model outperforms rule-based approaches by 23.6% in f1-score with zero-shot feature extraction; 5-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting positive sentiment towards correctly predicted app features, with 5-shot enhancing it by 7%. Our study suggests that LLM models are promising for generating feature-specific sentiment summaries of user reviews.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用大型语言模型对应用程序评论进行细粒度情感分析：评估研究

分析用户评论中对应用程序功能的情感可以为了解用户对应用程序功能的看法和不断变化的需求提供宝贵的信息。鉴于每天都会收到大量的用户评论，因此需要一种自动机制来生成用户评论的特征级情感摘要。大型语言模型（LLMs）（如 ChatGPT）的最新进展表明，在不更新模型参数的情况下，即使用零个或少量标记示例的情况下，它在一些新任务上的表现令人印象深刻。尽管取得了这些进步，但 LLMs 对用户评论进行特定特征情感分析的能力仍有待开发。本研究比较了最先进的 LLM（包括 GPT-4、ChatGPT 和 LLama-2-chat 变体）在 0 次拍摄、1 次拍摄和 5 次拍摄场景下提取应用特征和相关情感的性能。结果表明，表现最好的 GPT-4 模型在零次特征提取的 f1 分数上比基于规则的方法高出 23.6%；在 5 次特征提取的 f1 分数上进一步提高了 6%。在对正确预测的应用特征进行正面情感预测方面，GPT-4 的 f1 分数达到 74%，而 5-shot 则提高了 7%。我们的研究表明，LLM 模型有望生成针对用户评论特征的情感总结。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Software Engineering

自引率

0.00%

发文量