Large language models' performances regarding common patient questions about osteoarthritis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Perplexity.

IF 10.3 1区医学 Q1 HOSPITALITY, LEISURE, SPORT & TOURISM Journal of Sport and Health Science Pub Date : 2025-12-01 Epub Date: 2024-11-28 DOI:10.1016/j.jshs.2024.101016

Mingde Cao, Qianwen Wang, Xueyou Zhang, Zuru Liang, Jihong Qiu, Patrick Shu-Hang Yung, Michael Tim-Yun Ong

{"title":"Large language models' performances regarding common patient questions about osteoarthritis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Perplexity.","authors":"Mingde Cao, Qianwen Wang, Xueyou Zhang, Zuru Liang, Jihong Qiu, Patrick Shu-Hang Yung, Michael Tim-Yun Ong","doi":"10.1016/j.jshs.2024.101016","DOIUrl":null,"url":null,"abstract":"Background: Large Language Models (LLMs) have gained much attention and, in part, have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses. Osteoarthritis (OA) is a common topic in skeletal muscle disorders, and patients often seek information about it online. Our study evaluated the ability of 3 LLMs (ChatGPT-3.5, ChatGPT-4.0, and Perplexity) to accurately answer common OA-related queries.Methods: We defined 6 themes (pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis) based on a generalization of 25 frequently asked questions about OA. Three consultant-level orthopedic specialists independently rated the LLMs' replies on a 4-point accuracy scale. The final ratings for each response were determined using a majority consensus approach. Responses classified as \"satisfactory\" were evaluated for comprehensiveness on a 5-point scale.Results: ChatGPT-4.0 demonstrated superior accuracy, with 64% of responses rated as \"excellent\", compared to 40% for ChatGPT-3.5 and 28% for Perplexity (Pearson's χ2 test with Fisher's exact test, all p < 0.001). All 3 LLM-chatbots had high mean comprehensiveness ratings (Perplexity = 3.88; ChatGPT-4.0 = 4.56; ChatGPT-3.5 = 3.96, out of a maximum score of 5). The LLM-chatbots performed reliably across domains, except for \"treatment and prevention\" However, ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity, garnering 53.8% \"excellent\" ratings (Pearson's χ2 test with Fisher's exact test, all p < 0.001).Conclusion: Our findings underscore the potential of LLMs, specifically ChatGPT-4.0 and Perplexity, to deliver accurate and thorough responses to OA-related queries. Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.","PeriodicalId":48897,"journal":{"name":"Journal of Sport and Health Science","volume":" ","pages":"101016"},"PeriodicalIF":10.3000,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12268069/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Sport and Health Science","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jshs.2024.101016","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"HOSPITALITY, LEISURE, SPORT & TOURISM","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large Language Models (LLMs) have gained much attention and, in part, have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses. Osteoarthritis (OA) is a common topic in skeletal muscle disorders, and patients often seek information about it online. Our study evaluated the ability of 3 LLMs (ChatGPT-3.5, ChatGPT-4.0, and Perplexity) to accurately answer common OA-related queries.

Methods: We defined 6 themes (pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis) based on a generalization of 25 frequently asked questions about OA. Three consultant-level orthopedic specialists independently rated the LLMs' replies on a 4-point accuracy scale. The final ratings for each response were determined using a majority consensus approach. Responses classified as "satisfactory" were evaluated for comprehensiveness on a 5-point scale.

Results: ChatGPT-4.0 demonstrated superior accuracy, with 64% of responses rated as "excellent", compared to 40% for ChatGPT-3.5 and 28% for Perplexity (Pearson's χ² test with Fisher's exact test, all p < 0.001). All 3 LLM-chatbots had high mean comprehensiveness ratings (Perplexity = 3.88; ChatGPT-4.0 = 4.56; ChatGPT-3.5 = 3.96, out of a maximum score of 5). The LLM-chatbots performed reliably across domains, except for "treatment and prevention" However, ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity, garnering 53.8% "excellent" ratings (Pearson's χ² test with Fisher's exact test, all p < 0.001).

Conclusion: Our findings underscore the potential of LLMs, specifically ChatGPT-4.0 and Perplexity, to deliver accurate and thorough responses to OA-related queries. Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于骨关节炎常见患者问题的大型语言模型的性能：ChatGPT-3.5、ChatGPT-4.0和Perplexity的比较分析

背景：大型语言模型（llm）已经获得了广泛的关注，并且由于其上下文相关的响应，在一定程度上已经取代了普通的搜索引擎，成为获取信息的流行渠道。骨关节炎（OA）是骨骼肌疾病的常见话题，患者经常在网上寻找相关信息。我们的研究评估了3个LLMs （ChatGPT-3.5, ChatGPT-4.0和Perplexity）准确回答常见oa相关查询的能力。方法：总结25个OA常见问题，确定发病机制、危险因素、临床表现、诊断、治疗和预防、预后6个主题。三位顾问级别的骨科专家以4分的准确度对法学硕士的回答进行了独立评分。每个回答的最终评级是使用多数共识方法确定的。被分类为“满意”的回答以5分制对综合程度进行评估。结果：ChatGPT-4.0显示出更高的准确性，64%的回答被评为“优秀”，而ChatGPT-3.5为40%，Perplexity为28% （Pearson卡方检验与Fisher精确检验，均p < 0.001）。所有3个llm聊天机器人的平均综合评分都很高(Perplexity = 3.88；chatgpt - 4.0 = 4.56;ChatGPT-3.5 = 3.96，满分5分)。除了“治疗和预防”之外，llm聊天机器人在各个领域的表现都很可靠。然而，ChatGPT-4.0的表现仍然优于ChatGPT-3.5和Perplexity，获得53.8%的“优秀”评分（Pearson的卡方检验和Fisher的精确检验，均p < 0.001）。结论：我们的研究结果强调了法学硕士的潜力，特别是ChatGPT-4.0和Perplexity，可以为oa相关查询提供准确而彻底的响应。有针对性地纠正特定的误解，以提高法学硕士的准确性仍然至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Sport and Health Science SPORT SCIENCES-

CiteScore

18.30

自引率

1.70%

发文量

101

审稿时长

22 weeks

期刊介绍： The Journal of Sport and Health Science (JSHS) is an international, multidisciplinary journal that aims to advance the fields of sport, exercise, physical activity, and health sciences. Published by Elsevier B.V. on behalf of Shanghai University of Sport, JSHS is dedicated to promoting original and impactful research, as well as topical reviews, editorials, opinions, and commentary papers. With a focus on physical and mental health, injury and disease prevention, traditional Chinese exercise, and human performance, JSHS offers a platform for scholars and researchers to share their findings and contribute to the advancement of these fields. Our journal is peer-reviewed, ensuring that all published works meet the highest academic standards. Supported by a carefully selected international editorial board, JSHS upholds impeccable integrity and provides an efficient publication platform. We invite submissions from scholars and researchers worldwide, and we are committed to disseminating insightful and influential research in the field of sport and health science.