{"title":"A comparative analysis of large language models on clinical questions for autoimmune diseases.","authors":"Jing Chen, Juntao Ma, Jie Yu, Weiming Zhang, Yijia Zhu, Jiawei Feng, Linyu Geng, Xianchi Dong, Huayong Zhang, Yuxin Chen, Mingzhe Ning","doi":"10.3389/fdgth.2025.1530442","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) has made great strides. To explore the potential of Large Language Models (LLMs) in providing medical services to patients and assisting physicians in clinical practice, our study evaluated the performance in delivering clinical questions related to autoimmune diseases.</p><p><strong>Methods: </strong>46 questions related to autoimmune diseases were input into ChatGPT 3.5, ChatGPT 4.0, and Gemini. The responses were then evaluated by rheumatologists based on five quality dimensions: relevance, correctness, completeness, helpfulness, and safety. Simultaneously, the responses were assessed by laboratory specialists across six medical fields: concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis. Finally, statistical analysis and comparisons were performed on the performance of the three chatbots in the five quality dimensions and six medical fields.</p><p><strong>Results: </strong>ChatGPT 4.0 outperformed both ChatGPT 3.5 and Gemini across all five quality dimensions, with an average score of 199.8 ± 10.4, significantly higher than ChatGPT 3.5 (175.7 ± 16.6) and Gemini (179.1 ± 11.8) (<i>p</i> = 0.009 and <i>p</i> = 0.001, respectively). The average performance differences between ChatGPT 3.5 and Gemini across these five dimensions were not statistically significant. Specifically, ChatGPT 4.0 demonstrated superior performance in relevance (<i>p</i> < 0.0001, <i>p</i> < 0.0001), completeness (<i>p</i> < 0.0001, <i>p</i> = 0.0006), correctness (<i>p</i> = 0.0001, <i>p</i> = 0.0002), helpfulness (<i>p</i> < 0.0001, <i>p</i> < 0.0001), and safety (<i>p</i> < 0.0001, <i>p</i> = 0.0025) compared to both ChatGPT 3.5 and Gemini. Furthermore, ChatGPT 4.0 scored significantly higher than both ChatGPT 3.5 and Gemini in medical fields such as report interpretation (<i>p</i> < 0.0001, <i>p</i> = 0.0025), prevention and treatment (<i>p</i> < 0.0001, <i>p</i> = 0.0103), prognosis (<i>p</i> = 0.0458, <i>p</i> = 0.0458).</p><p><strong>Conclusions: </strong>This study demonstrates that ChatGPT 4.0 significantly outperforms ChatGPT 3.5 and Gemini in addressing clinical questions related to autoimmune diseases, showing notable advantages across all five quality dimensions and six clinical domains. These findings further highlight the potential of large language models in enhancing healthcare services.</p>","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"7 ","pages":"1530442"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11913117/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2025.1530442","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Artificial intelligence (AI) has made great strides. To explore the potential of Large Language Models (LLMs) in providing medical services to patients and assisting physicians in clinical practice, our study evaluated the performance in delivering clinical questions related to autoimmune diseases.
Methods: 46 questions related to autoimmune diseases were input into ChatGPT 3.5, ChatGPT 4.0, and Gemini. The responses were then evaluated by rheumatologists based on five quality dimensions: relevance, correctness, completeness, helpfulness, and safety. Simultaneously, the responses were assessed by laboratory specialists across six medical fields: concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis. Finally, statistical analysis and comparisons were performed on the performance of the three chatbots in the five quality dimensions and six medical fields.
Results: ChatGPT 4.0 outperformed both ChatGPT 3.5 and Gemini across all five quality dimensions, with an average score of 199.8 ± 10.4, significantly higher than ChatGPT 3.5 (175.7 ± 16.6) and Gemini (179.1 ± 11.8) (p = 0.009 and p = 0.001, respectively). The average performance differences between ChatGPT 3.5 and Gemini across these five dimensions were not statistically significant. Specifically, ChatGPT 4.0 demonstrated superior performance in relevance (p < 0.0001, p < 0.0001), completeness (p < 0.0001, p = 0.0006), correctness (p = 0.0001, p = 0.0002), helpfulness (p < 0.0001, p < 0.0001), and safety (p < 0.0001, p = 0.0025) compared to both ChatGPT 3.5 and Gemini. Furthermore, ChatGPT 4.0 scored significantly higher than both ChatGPT 3.5 and Gemini in medical fields such as report interpretation (p < 0.0001, p = 0.0025), prevention and treatment (p < 0.0001, p = 0.0103), prognosis (p = 0.0458, p = 0.0458).
Conclusions: This study demonstrates that ChatGPT 4.0 significantly outperforms ChatGPT 3.5 and Gemini in addressing clinical questions related to autoimmune diseases, showing notable advantages across all five quality dimensions and six clinical domains. These findings further highlight the potential of large language models in enhancing healthcare services.