A comparative analysis of large language models on clinical questions for autoimmune diseases.

IF 3.2 Q1 HEALTH CARE SCIENCES & SERVICES Frontiers in digital health Pub Date : 2025-03-03 eCollection Date: 2025-01-01 DOI:10.3389/fdgth.2025.1530442
Jing Chen, Juntao Ma, Jie Yu, Weiming Zhang, Yijia Zhu, Jiawei Feng, Linyu Geng, Xianchi Dong, Huayong Zhang, Yuxin Chen, Mingzhe Ning
{"title":"A comparative analysis of large language models on clinical questions for autoimmune diseases.","authors":"Jing Chen, Juntao Ma, Jie Yu, Weiming Zhang, Yijia Zhu, Jiawei Feng, Linyu Geng, Xianchi Dong, Huayong Zhang, Yuxin Chen, Mingzhe Ning","doi":"10.3389/fdgth.2025.1530442","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) has made great strides. To explore the potential of Large Language Models (LLMs) in providing medical services to patients and assisting physicians in clinical practice, our study evaluated the performance in delivering clinical questions related to autoimmune diseases.</p><p><strong>Methods: </strong>46 questions related to autoimmune diseases were input into ChatGPT 3.5, ChatGPT 4.0, and Gemini. The responses were then evaluated by rheumatologists based on five quality dimensions: relevance, correctness, completeness, helpfulness, and safety. Simultaneously, the responses were assessed by laboratory specialists across six medical fields: concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis. Finally, statistical analysis and comparisons were performed on the performance of the three chatbots in the five quality dimensions and six medical fields.</p><p><strong>Results: </strong>ChatGPT 4.0 outperformed both ChatGPT 3.5 and Gemini across all five quality dimensions, with an average score of 199.8 ± 10.4, significantly higher than ChatGPT 3.5 (175.7 ± 16.6) and Gemini (179.1 ± 11.8) (<i>p</i> = 0.009 and <i>p</i> = 0.001, respectively). The average performance differences between ChatGPT 3.5 and Gemini across these five dimensions were not statistically significant. Specifically, ChatGPT 4.0 demonstrated superior performance in relevance (<i>p</i> < 0.0001, <i>p</i> < 0.0001), completeness (<i>p</i> < 0.0001, <i>p</i> = 0.0006), correctness (<i>p</i> = 0.0001, <i>p</i> = 0.0002), helpfulness (<i>p</i> < 0.0001, <i>p</i> < 0.0001), and safety (<i>p</i> < 0.0001, <i>p</i> = 0.0025) compared to both ChatGPT 3.5 and Gemini. Furthermore, ChatGPT 4.0 scored significantly higher than both ChatGPT 3.5 and Gemini in medical fields such as report interpretation (<i>p</i> < 0.0001, <i>p</i> = 0.0025), prevention and treatment (<i>p</i> < 0.0001, <i>p</i> = 0.0103), prognosis (<i>p</i> = 0.0458, <i>p</i> = 0.0458).</p><p><strong>Conclusions: </strong>This study demonstrates that ChatGPT 4.0 significantly outperforms ChatGPT 3.5 and Gemini in addressing clinical questions related to autoimmune diseases, showing notable advantages across all five quality dimensions and six clinical domains. These findings further highlight the potential of large language models in enhancing healthcare services.</p>","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"7 ","pages":"1530442"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11913117/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2025.1530442","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Artificial intelligence (AI) has made great strides. To explore the potential of Large Language Models (LLMs) in providing medical services to patients and assisting physicians in clinical practice, our study evaluated the performance in delivering clinical questions related to autoimmune diseases.

Methods: 46 questions related to autoimmune diseases were input into ChatGPT 3.5, ChatGPT 4.0, and Gemini. The responses were then evaluated by rheumatologists based on five quality dimensions: relevance, correctness, completeness, helpfulness, and safety. Simultaneously, the responses were assessed by laboratory specialists across six medical fields: concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis. Finally, statistical analysis and comparisons were performed on the performance of the three chatbots in the five quality dimensions and six medical fields.

Results: ChatGPT 4.0 outperformed both ChatGPT 3.5 and Gemini across all five quality dimensions, with an average score of 199.8 ± 10.4, significantly higher than ChatGPT 3.5 (175.7 ± 16.6) and Gemini (179.1 ± 11.8) (p = 0.009 and p = 0.001, respectively). The average performance differences between ChatGPT 3.5 and Gemini across these five dimensions were not statistically significant. Specifically, ChatGPT 4.0 demonstrated superior performance in relevance (p < 0.0001, p < 0.0001), completeness (p < 0.0001, p = 0.0006), correctness (p = 0.0001, p = 0.0002), helpfulness (p < 0.0001, p < 0.0001), and safety (p < 0.0001, p = 0.0025) compared to both ChatGPT 3.5 and Gemini. Furthermore, ChatGPT 4.0 scored significantly higher than both ChatGPT 3.5 and Gemini in medical fields such as report interpretation (p < 0.0001, p = 0.0025), prevention and treatment (p < 0.0001, p = 0.0103), prognosis (p = 0.0458, p = 0.0458).

Conclusions: This study demonstrates that ChatGPT 4.0 significantly outperforms ChatGPT 3.5 and Gemini in addressing clinical questions related to autoimmune diseases, showing notable advantages across all five quality dimensions and six clinical domains. These findings further highlight the potential of large language models in enhancing healthcare services.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
自身免疫性疾病临床问题大语言模型的比较分析。
背景:人工智能(AI)已经取得了巨大的进步。为了探索大型语言模型(LLMs)在为患者提供医疗服务和协助医生进行临床实践方面的潜力,我们的研究评估了其在提供与自身免疫性疾病相关的临床问题方面的表现。方法:在ChatGPT 3.5、ChatGPT 4.0和Gemini中输入46个与自身免疫性疾病相关的问题。然后风湿病学家根据5个质量维度对反馈进行评估:相关性、正确性、完整性、有用性和安全性。同时,实验室专家评估了六个医学领域的反应:概念、临床特征、报告解释、诊断、预防和治疗以及预后。最后,对三种聊天机器人在五个质量维度和六个医学领域的表现进行统计分析和比较。结果:ChatGPT 4.0在5个质量维度上均优于ChatGPT 3.5和Gemini,平均得分为199.8±10.4,显著高于ChatGPT 3.5(175.7±16.6)和Gemini(179.1±11.8)(p = 0.009和p = 0.001)。ChatGPT 3.5和Gemini在这五个维度上的平均性能差异没有统计学意义。具体来说,与ChatGPT 3.5和Gemini相比,ChatGPT 4.0在相关性(p p p p = 0.0006)、正确性(p = 0.0001, p = 0.0002)、帮助性(p p p p = 0.0025)方面表现出了更好的性能。在报告解读(p = 0.0025)、预防与治疗(p = 0.0103)、预后(p = 0.0458、p = 0.0458)等医学领域,ChatGPT 4.0得分均显著高于ChatGPT 3.5和Gemini。结论:本研究表明,ChatGPT 4.0在解决自身免疫性疾病相关临床问题方面明显优于ChatGPT 3.5和Gemini,在所有5个质量维度和6个临床领域均表现出显著优势。这些发现进一步强调了大型语言模型在增强医疗保健服务方面的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
4.20
自引率
0.00%
发文量
0
审稿时长
13 weeks
期刊最新文献
Development of reconfigurable smart medical wards using integrated components and complex features. A maturity model framework for federated networks of trusted research environments. Portable automated rapid testing for auditory assessment: repeated at-home testing in older adults. Why health information technology safety problems remain invisible. MAPSeg: self-supervised colorectal polyp segmentation via memory-augmented framework and synthetic polyp simulation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1