Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.

IF 5.7 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES Journal of Medical Systems Pub Date : 2025-03-25 DOI:10.1007/s10916-025-02170-7

Jinze Li, Chao Chang, Yanqiu Li, Shengyu Cui, Fan Yuan, Zhuojun Li, Xinyu Wang, Kang Li, Yuxin Feng, Zuowei Wang, Zhijian Wei, Fengzeng Jian

{"title":"Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.","authors":"Jinze Li, Chao Chang, Yanqiu Li, Shengyu Cui, Fan Yuan, Zhuojun Li, Xinyu Wang, Kang Li, Yuxin Feng, Zuowei Wang, Zhijian Wei, Fengzeng Jian","doi":"10.1007/s10916-025-02170-7","DOIUrl":null,"url":null,"abstract":"<p><p>With the increasing application of large language models (LLMs) in the medical field, their potential in patient education and clinical decision support is becoming increasingly prominent. Given the complex pathogenesis, diverse treatment options, and lengthy rehabilitation periods of spinal cord injury (SCI), patients are increasingly turning to advanced online resources to obtain relevant medical information. This study analyzed responses from four LLMs-ChatGPT-4o, Claude-3.5 sonnet, Gemini-1.5 Pro, and Llama-3.1-to 37 SCI-related questions spanning pathogenesis, risk factors, clinical features, diagnostics, treatments, and prognosis. Quality and readability were assessed using the Ensuring Quality Information for Patients (EQIP) tool and Flesch-Kincaid metrics, respectively. Accuracy was independently scored by three senior spine surgeons using consensus scoring. Performance varied among the models. Gemini ranked highest in EQIP scores, suggesting superior information quality. Although the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they were all able to effectively simplify complex content. Notably, ChatGPT led in accuracy, achieving significantly higher \"Good\" ratings (83.8%) compared to Claude (78.4%), Gemini (54.1%), and Llama (62.2%). Comprehensiveness scores were high across all models. Furthermore, the LLMs exhibited strong self-correction abilities. After being prompted for revision, the accuracy of ChatGPT and Claude's responses improved by 100% and 50%, respectively; both Gemini and Llama improved by 67%. This study represents the first systematic comparison of leading LLMs in the context of SCI. While Gemini excelled in response quality, ChatGPT provided the most accurate and comprehensive responses.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"39"},"PeriodicalIF":5.7000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-025-02170-7","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

With the increasing application of large language models (LLMs) in the medical field, their potential in patient education and clinical decision support is becoming increasingly prominent. Given the complex pathogenesis, diverse treatment options, and lengthy rehabilitation periods of spinal cord injury (SCI), patients are increasingly turning to advanced online resources to obtain relevant medical information. This study analyzed responses from four LLMs-ChatGPT-4o, Claude-3.5 sonnet, Gemini-1.5 Pro, and Llama-3.1-to 37 SCI-related questions spanning pathogenesis, risk factors, clinical features, diagnostics, treatments, and prognosis. Quality and readability were assessed using the Ensuring Quality Information for Patients (EQIP) tool and Flesch-Kincaid metrics, respectively. Accuracy was independently scored by three senior spine surgeons using consensus scoring. Performance varied among the models. Gemini ranked highest in EQIP scores, suggesting superior information quality. Although the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they were all able to effectively simplify complex content. Notably, ChatGPT led in accuracy, achieving significantly higher "Good" ratings (83.8%) compared to Claude (78.4%), Gemini (54.1%), and Llama (62.2%). Comprehensiveness scores were high across all models. Furthermore, the LLMs exhibited strong self-correction abilities. After being prompted for revision, the accuracy of ChatGPT and Claude's responses improved by 100% and 50%, respectively; both Gemini and Llama improved by 67%. This study represents the first systematic comparison of leading LLMs in the context of SCI. While Gemini excelled in response quality, ChatGPT provided the most accurate and comprehensive responses.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大型语言模型对脊髓损伤的反应：性能的比较研究。

随着大语言模型（llm）在医学领域的应用越来越多，其在患者教育和临床决策支持方面的潜力日益突出。由于脊髓损伤的发病机制复杂、治疗方案多样、康复期长，越来越多的患者通过先进的网络资源获取相关的医学信息。本研究分析了4个llms （chatgpt - 40、Claude-3.5 sonnet、Gemini-1.5 Pro和llama -3.1）对37个sci相关问题的回答，涉及发病机制、危险因素、临床特征、诊断、治疗和预后。质量和可读性分别使用确保患者质量信息（EQIP）工具和Flesch-Kincaid指标进行评估。准确性由三位资深脊柱外科医生采用共识评分法独立评分。各型号的性能各不相同。双子座在EQIP得分中排名最高，表明信息质量更高。虽然这四个llm的可读性普遍较低，需要大学水平的阅读理解能力，但他们都能够有效地简化复杂的内容。值得注意的是，ChatGPT在准确率上领先，获得了明显更高的“好”评级（83.8%），而Claude (78.4%), Gemini（54.1%）和Llama（62.2%）。所有模型的综合得分都很高。此外，法学硕士表现出较强的自我纠正能力。在提示修改后，ChatGPT和Claude的回答准确率分别提高了100%和50%；双子和羊驼都提高了67%。本研究首次对SCI背景下的领先法学硕士进行了系统比较。Gemini在回应质量上更胜一筹，ChatGPT则提供了最准确、最全面的回应。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Medical Systems 医学-卫生保健

CiteScore

11.60

自引率

1.90%

发文量

审稿时长

4.8 months

期刊介绍： Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.