Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.

IF 5.7 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES Journal of Medical Systems Pub Date : 2025-03-25 DOI:10.1007/s10916-025-02170-7
Jinze Li, Chao Chang, Yanqiu Li, Shengyu Cui, Fan Yuan, Zhuojun Li, Xinyu Wang, Kang Li, Yuxin Feng, Zuowei Wang, Zhijian Wei, Fengzeng Jian
{"title":"Large Language Models' Responses to Spinal Cord Injury: A Comparative Study of Performance.","authors":"Jinze Li, Chao Chang, Yanqiu Li, Shengyu Cui, Fan Yuan, Zhuojun Li, Xinyu Wang, Kang Li, Yuxin Feng, Zuowei Wang, Zhijian Wei, Fengzeng Jian","doi":"10.1007/s10916-025-02170-7","DOIUrl":null,"url":null,"abstract":"<p><p>With the increasing application of large language models (LLMs) in the medical field, their potential in patient education and clinical decision support is becoming increasingly prominent. Given the complex pathogenesis, diverse treatment options, and lengthy rehabilitation periods of spinal cord injury (SCI), patients are increasingly turning to advanced online resources to obtain relevant medical information. This study analyzed responses from four LLMs-ChatGPT-4o, Claude-3.5 sonnet, Gemini-1.5 Pro, and Llama-3.1-to 37 SCI-related questions spanning pathogenesis, risk factors, clinical features, diagnostics, treatments, and prognosis. Quality and readability were assessed using the Ensuring Quality Information for Patients (EQIP) tool and Flesch-Kincaid metrics, respectively. Accuracy was independently scored by three senior spine surgeons using consensus scoring. Performance varied among the models. Gemini ranked highest in EQIP scores, suggesting superior information quality. Although the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they were all able to effectively simplify complex content. Notably, ChatGPT led in accuracy, achieving significantly higher \"Good\" ratings (83.8%) compared to Claude (78.4%), Gemini (54.1%), and Llama (62.2%). Comprehensiveness scores were high across all models. Furthermore, the LLMs exhibited strong self-correction abilities. After being prompted for revision, the accuracy of ChatGPT and Claude's responses improved by 100% and 50%, respectively; both Gemini and Llama improved by 67%. This study represents the first systematic comparison of leading LLMs in the context of SCI. While Gemini excelled in response quality, ChatGPT provided the most accurate and comprehensive responses.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"39"},"PeriodicalIF":5.7000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-025-02170-7","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

With the increasing application of large language models (LLMs) in the medical field, their potential in patient education and clinical decision support is becoming increasingly prominent. Given the complex pathogenesis, diverse treatment options, and lengthy rehabilitation periods of spinal cord injury (SCI), patients are increasingly turning to advanced online resources to obtain relevant medical information. This study analyzed responses from four LLMs-ChatGPT-4o, Claude-3.5 sonnet, Gemini-1.5 Pro, and Llama-3.1-to 37 SCI-related questions spanning pathogenesis, risk factors, clinical features, diagnostics, treatments, and prognosis. Quality and readability were assessed using the Ensuring Quality Information for Patients (EQIP) tool and Flesch-Kincaid metrics, respectively. Accuracy was independently scored by three senior spine surgeons using consensus scoring. Performance varied among the models. Gemini ranked highest in EQIP scores, suggesting superior information quality. Although the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they were all able to effectively simplify complex content. Notably, ChatGPT led in accuracy, achieving significantly higher "Good" ratings (83.8%) compared to Claude (78.4%), Gemini (54.1%), and Llama (62.2%). Comprehensiveness scores were high across all models. Furthermore, the LLMs exhibited strong self-correction abilities. After being prompted for revision, the accuracy of ChatGPT and Claude's responses improved by 100% and 50%, respectively; both Gemini and Llama improved by 67%. This study represents the first systematic comparison of leading LLMs in the context of SCI. While Gemini excelled in response quality, ChatGPT provided the most accurate and comprehensive responses.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型语言模型对脊髓损伤的反应:性能的比较研究。
随着大语言模型(llm)在医学领域的应用越来越多,其在患者教育和临床决策支持方面的潜力日益突出。由于脊髓损伤的发病机制复杂、治疗方案多样、康复期长,越来越多的患者通过先进的网络资源获取相关的医学信息。本研究分析了4个llms (chatgpt - 40、Claude-3.5 sonnet、Gemini-1.5 Pro和llama -3.1)对37个sci相关问题的回答,涉及发病机制、危险因素、临床特征、诊断、治疗和预后。质量和可读性分别使用确保患者质量信息(EQIP)工具和Flesch-Kincaid指标进行评估。准确性由三位资深脊柱外科医生采用共识评分法独立评分。各型号的性能各不相同。双子座在EQIP得分中排名最高,表明信息质量更高。虽然这四个llm的可读性普遍较低,需要大学水平的阅读理解能力,但他们都能够有效地简化复杂的内容。值得注意的是,ChatGPT在准确率上领先,获得了明显更高的“好”评级(83.8%),而Claude (78.4%), Gemini(54.1%)和Llama(62.2%)。所有模型的综合得分都很高。此外,法学硕士表现出较强的自我纠正能力。在提示修改后,ChatGPT和Claude的回答准确率分别提高了100%和50%;双子和羊驼都提高了67%。本研究首次对SCI背景下的领先法学硕士进行了系统比较。Gemini在回应质量上更胜一筹,ChatGPT则提供了最准确、最全面的回应。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Medical Systems
Journal of Medical Systems 医学-卫生保健
CiteScore
11.60
自引率
1.90%
发文量
83
审稿时长
4.8 months
期刊介绍: Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.
期刊最新文献
Clinical Use of Non-Certified Generative AI in Healthcare: Governing the Regulatory Grey Zone from Convenience to Legal Accountability. Protecting Peer Review from the Surge of Low Fidelity Systematic Reviews in the Generative AI Era. Tracking Greenhouse Gas Emission Initiatives Across a Large Academic Health System Utilizing Innovative Dashboards. The Impact of Healthcare Professionals' Characteristics on the Evaluation of Clinical Decision Support Systems: Insights from a Cross-Country Usability and Technology Acceptance Study of the iCARE Tool. Emerging Utility of Multimodal Large Language Models in Cardiovascular Diagnostics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1