大型语言模型模拟人类专家对社交媒体加热烟草产品舆情评价的准确性：评价研究

IF 6 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES Journal of Medical Internet Research Pub Date : 2025-03-04 DOI:10.2196/63631

Kwanho Kim, Soojong Kim

{"title":"大型语言模型模拟人类专家对社交媒体加热烟草产品舆情评价的准确性：评价研究","authors":"Kwanho Kim, Soojong Kim","doi":"10.2196/63631","DOIUrl":null,"url":null,"abstract":"Background: Sentiment analysis of alternative tobacco products discussed on social media is crucial in tobacco control research. Large language models (LLMs) are artificial intelligence models that were trained on extensive text data to emulate the linguistic patterns of humans. LLMs may hold the potential to streamline the time-consuming and labor-intensive process of human sentiment analysis.Objective: This study aimed to examine the accuracy of LLMs in replicating human sentiment evaluation of social media messages relevant to heated tobacco products (HTPs).Methods: GPT-3.5 and GPT-4 Turbo (OpenAI) were used to classify 500 Facebook (Meta Platforms) and 500 Twitter (subsequently rebranded X) messages. Each set consisted of 200 human-labeled anti-HTPs, 200 pro-HTPs, and 100 neutral messages. The models evaluated each message up to 20 times to generate multiple response instances reporting its classification decisions. The majority of the labels from these responses were assigned as a model's decision for the message. The models' classification decisions were then compared with those of human evaluators.Results: GPT-3.5 accurately replicated human sentiment evaluation in 61.2% of Facebook messages and 57% of Twitter messages. GPT-4 Turbo demonstrated higher accuracies overall, with 81.7% for Facebook messages and 77% for Twitter messages. GPT-4 Turbo's accuracy with 3 response instances reached 99% of the accuracy achieved with 20 response instances. GPT-4 Turbo's accuracy was higher for human-labeled anti- and pro-HTP messages compared with neutral messages. Most of the GPT-3.5 misclassifications occurred when anti- or pro-HTP messages were incorrectly classified as neutral or irrelevant by the model, whereas GPT-4 Turbo showed improvements across all sentiment categories and reduced misclassifications, especially in incorrectly categorized messages as irrelevant.Conclusions: LLMs can be used to analyze sentiment in social media messages about HTPs. Results from GPT-4 Turbo suggest that accuracy can reach approximately 80% compared with the results of human experts, even with a small number of labeling decisions generated by the model. A potential risk of using LLMs is the misrepresentation of the overall sentiment due to the differences in accuracy across sentiment categories. Although this issue could be reduced with the newer language model, future efforts should explore the mechanisms underlying the discrepancies and how to address them systematically.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e63631"},"PeriodicalIF":6.0000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11920658/pdf/","citationCount":"0","resultStr":"{\"title\":\"Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media: Evaluation Study.\",\"authors\":\"Kwanho Kim, Soojong Kim\",\"doi\":\"10.2196/63631\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Sentiment analysis of alternative tobacco products discussed on social media is crucial in tobacco control research. Large language models (LLMs) are artificial intelligence models that were trained on extensive text data to emulate the linguistic patterns of humans. LLMs may hold the potential to streamline the time-consuming and labor-intensive process of human sentiment analysis.Objective: This study aimed to examine the accuracy of LLMs in replicating human sentiment evaluation of social media messages relevant to heated tobacco products (HTPs).Methods: GPT-3.5 and GPT-4 Turbo (OpenAI) were used to classify 500 Facebook (Meta Platforms) and 500 Twitter (subsequently rebranded X) messages. Each set consisted of 200 human-labeled anti-HTPs, 200 pro-HTPs, and 100 neutral messages. The models evaluated each message up to 20 times to generate multiple response instances reporting its classification decisions. The majority of the labels from these responses were assigned as a model's decision for the message. The models' classification decisions were then compared with those of human evaluators.Results: GPT-3.5 accurately replicated human sentiment evaluation in 61.2% of Facebook messages and 57% of Twitter messages. GPT-4 Turbo demonstrated higher accuracies overall, with 81.7% for Facebook messages and 77% for Twitter messages. GPT-4 Turbo's accuracy with 3 response instances reached 99% of the accuracy achieved with 20 response instances. GPT-4 Turbo's accuracy was higher for human-labeled anti- and pro-HTP messages compared with neutral messages. Most of the GPT-3.5 misclassifications occurred when anti- or pro-HTP messages were incorrectly classified as neutral or irrelevant by the model, whereas GPT-4 Turbo showed improvements across all sentiment categories and reduced misclassifications, especially in incorrectly categorized messages as irrelevant.Conclusions: LLMs can be used to analyze sentiment in social media messages about HTPs. Results from GPT-4 Turbo suggest that accuracy can reach approximately 80% compared with the results of human experts, even with a small number of labeling decisions generated by the model. A potential risk of using LLMs is the misrepresentation of the overall sentiment due to the differences in accuracy across sentiment categories. Although this issue could be reduced with the newer language model, future efforts should explore the mechanisms underlying the discrepancies and how to address them systematically.\",\"PeriodicalId\":16337,\"journal\":{\"name\":\"Journal of Medical Internet Research\",\"volume\":\"27 \",\"pages\":\"e63631\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-03-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11920658/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Internet Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/63631\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/63631","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：社交媒体上讨论的替代烟草产品的情绪分析在控烟研究中至关重要。大型语言模型（llm）是一种人工智能模型，它在大量文本数据上进行训练，以模拟人类的语言模式。法学硕士可能具有简化耗时和劳动密集型的人类情感分析过程的潜力。目的：本研究旨在检验法学硕士在复制与加热烟草产品（HTPs）相关的社交媒体信息的人类情感评估中的准确性。方法：使用GPT-3.5和GPT-4 Turbo （OpenAI）对500条Facebook （Meta平台）和500条Twitter（随后更名为X）消息进行分类。每组包括200条人工标记的抗htp、200条支持htp和100条中性信息。这些模型对每条消息最多评估20次，以生成报告其分类决策的多个响应实例。来自这些响应的大多数标签被分配为消息的模型决策。然后将模型的分类决策与人类评估者的分类决策进行比较。结果：GPT-3.5在61.2%的Facebook消息和57%的Twitter消息中准确地复制了人类情感评估。GPT-4 Turbo总体上显示出更高的准确率，Facebook消息的准确率为81.7%，Twitter消息的准确率为77%。GPT-4 Turbo在3个响应实例下的准确度达到了20个响应实例准确度的99%。与中性信息相比，GPT-4 Turbo对人工标记的反和亲http信息的准确性更高。大多数GPT-3.5错误分类发生在反或支持http的消息被模型错误地分类为中立或不相关时，而GPT-4 Turbo在所有情绪类别中都有所改善，并减少了错误分类，特别是在错误分类为不相关的消息中。结论：llm可以用于分析社交媒体信息中关于https的情绪。GPT-4 Turbo的结果表明，即使模型生成少量的标签决策，与人类专家的结果相比，准确率也可以达到约80%。使用llm的潜在风险是由于不同情绪类别的准确性差异而导致整体情绪的错误陈述。虽然新的语言模型可以减少这个问题，但未来的努力应该探索差异背后的机制以及如何系统地解决它们。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media: Evaluation Study.

Background: Sentiment analysis of alternative tobacco products discussed on social media is crucial in tobacco control research. Large language models (LLMs) are artificial intelligence models that were trained on extensive text data to emulate the linguistic patterns of humans. LLMs may hold the potential to streamline the time-consuming and labor-intensive process of human sentiment analysis.

Objective: This study aimed to examine the accuracy of LLMs in replicating human sentiment evaluation of social media messages relevant to heated tobacco products (HTPs).

Methods: GPT-3.5 and GPT-4 Turbo (OpenAI) were used to classify 500 Facebook (Meta Platforms) and 500 Twitter (subsequently rebranded X) messages. Each set consisted of 200 human-labeled anti-HTPs, 200 pro-HTPs, and 100 neutral messages. The models evaluated each message up to 20 times to generate multiple response instances reporting its classification decisions. The majority of the labels from these responses were assigned as a model's decision for the message. The models' classification decisions were then compared with those of human evaluators.

Results: GPT-3.5 accurately replicated human sentiment evaluation in 61.2% of Facebook messages and 57% of Twitter messages. GPT-4 Turbo demonstrated higher accuracies overall, with 81.7% for Facebook messages and 77% for Twitter messages. GPT-4 Turbo's accuracy with 3 response instances reached 99% of the accuracy achieved with 20 response instances. GPT-4 Turbo's accuracy was higher for human-labeled anti- and pro-HTP messages compared with neutral messages. Most of the GPT-3.5 misclassifications occurred when anti- or pro-HTP messages were incorrectly classified as neutral or irrelevant by the model, whereas GPT-4 Turbo showed improvements across all sentiment categories and reduced misclassifications, especially in incorrectly categorized messages as irrelevant.

Conclusions: LLMs can be used to analyze sentiment in social media messages about HTPs. Results from GPT-4 Turbo suggest that accuracy can reach approximately 80% compared with the results of human experts, even with a small number of labeling decisions generated by the model. A potential risk of using LLMs is the misrepresentation of the overall sentiment due to the differences in accuracy across sentiment categories. Although this issue could be reduced with the newer language model, future efforts should explore the mechanisms underlying the discrepancies and how to address them systematically.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.