A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study

Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma
{"title":"A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study","authors":"Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma","doi":"arxiv-2409.07162","DOIUrl":null,"url":null,"abstract":"Analyzing user reviews for sentiment towards app features can provide\nvaluable insights into users' perceptions of app functionality and their\nevolving needs. Given the volume of user reviews received daily, an automated\nmechanism to generate feature-level sentiment summaries of user reviews is\nneeded. Recent advances in Large Language Models (LLMs) such as ChatGPT have\nshown impressive performance on several new tasks without updating the model's\nparameters i.e. using zero or a few labeled examples. Despite these\nadvancements, LLMs' capabilities to perform feature-specific sentiment analysis\nof user reviews remain unexplored. This study compares the performance of\nstate-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for\nextracting app features and associated sentiments under 0-shot, 1-shot, and\n5-shot scenarios. Results indicate the best-performing GPT-4 model outperforms\nrule-based approaches by 23.6% in f1-score with zero-shot feature extraction;\n5-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting\npositive sentiment towards correctly predicted app features, with 5-shot\nenhancing it by 7%. Our study suggests that LLM models are promising for\ngenerating feature-specific sentiment summaries of user reviews.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07162","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Analyzing user reviews for sentiment towards app features can provide valuable insights into users' perceptions of app functionality and their evolving needs. Given the volume of user reviews received daily, an automated mechanism to generate feature-level sentiment summaries of user reviews is needed. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model's parameters i.e. using zero or a few labeled examples. Despite these advancements, LLMs' capabilities to perform feature-specific sentiment analysis of user reviews remain unexplored. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for extracting app features and associated sentiments under 0-shot, 1-shot, and 5-shot scenarios. Results indicate the best-performing GPT-4 model outperforms rule-based approaches by 23.6% in f1-score with zero-shot feature extraction; 5-shot further improving it by 6%. GPT-4 achieves a 74% f1-score for predicting positive sentiment towards correctly predicted app features, with 5-shot enhancing it by 7%. Our study suggests that LLM models are promising for generating feature-specific sentiment summaries of user reviews.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用大型语言模型对应用程序评论进行细粒度情感分析:评估研究
分析用户评论中对应用程序功能的情感可以为了解用户对应用程序功能的看法和不断变化的需求提供宝贵的信息。鉴于每天都会收到大量的用户评论,因此需要一种自动机制来生成用户评论的特征级情感摘要。大型语言模型(LLMs)(如 ChatGPT)的最新进展表明,在不更新模型参数的情况下,即使用零个或少量标记示例的情况下,它在一些新任务上的表现令人印象深刻。尽管取得了这些进步,但 LLMs 对用户评论进行特定特征情感分析的能力仍有待开发。本研究比较了最先进的 LLM(包括 GPT-4、ChatGPT 和 LLama-2-chat 变体)在 0 次拍摄、1 次拍摄和 5 次拍摄场景下提取应用特征和相关情感的性能。结果表明,表现最好的 GPT-4 模型在零次特征提取的 f1 分数上比基于规则的方法高出 23.6%;在 5 次特征提取的 f1 分数上进一步提高了 6%。在对正确预测的应用特征进行正面情感预测方面,GPT-4 的 f1 分数达到 74%,而 5-shot 则提高了 7%。我们的研究表明,LLM 模型有望生成针对用户评论特征的情感总结。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization Shannon Entropy is better Feature than Category and Sentiment in User Feedback Processing Motivations, Challenges, Best Practices, and Benefits for Bots and Conversational Agents in Software Engineering: A Multivocal Literature Review A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems Investigating team maturity in an agile automotive reorganization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1