Performance of large language models on advocating the management of meningitis: a comparative qualitative stud

IF 4.1 Q1 HEALTH CARE SCIENCES & SERVICES BMJ Health & Care Informatics Pub Date : 2024-02-01 DOI:10.1136/bmjhci-2023-100978
Urs Fisch, Paulina Kliem, Pascale Grzonka, Raoul Sutter
{"title":"Performance of large language models on advocating the management of meningitis: a comparative qualitative stud","authors":"Urs Fisch, Paulina Kliem, Pascale Grzonka, Raoul Sutter","doi":"10.1136/bmjhci-2023-100978","DOIUrl":null,"url":null,"abstract":"Objectives We aimed to examine the adherence of large language models (LLMs) to bacterial meningitis guidelines using a hypothetical medical case, highlighting their utility and limitations in healthcare. Methods A simulated clinical scenario of a patient with bacterial meningitis secondary to mastoiditis was presented in three independent sessions to seven publicly accessible LLMs (Bard, Bing, Claude-2, GTP-3.5, GTP-4, Llama, PaLM). Responses were evaluated for adherence to good clinical practice and two international meningitis guidelines. Results A central nervous system infection was identified in 90% of LLM sessions. All recommended imaging, while 81% suggested lumbar puncture. Blood cultures and specific mastoiditis work-up were proposed in only 62% and 38% sessions, respectively. Only 38% of sessions provided the correct empirical antibiotic treatment, while antiviral treatment and dexamethasone were advised in 33% and 24%, respectively. Misleading statements were generated in 52%. No significant correlation was found between LLMs’ text length and performance (r=0.29, p=0.20). Among all LLMs, GTP-4 demonstrated the best performance. Discussion Latest LLMs provide valuable advice on differential diagnosis and diagnostic procedures but significantly vary in treatment-specific information for bacterial meningitis when introduced to a realistic clinical scenario. Misleading statements were common, with performance differences attributed to each LLM’s unique algorithm rather than output length. Conclusions Users must be aware of such limitations and performance variability when considering LLMs as a support tool for medical decision-making. Further research is needed to refine these models' comprehension of complex medical scenarios and their ability to provide reliable information. Data are available upon reasonable request.","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"6 1","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2023-100978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives We aimed to examine the adherence of large language models (LLMs) to bacterial meningitis guidelines using a hypothetical medical case, highlighting their utility and limitations in healthcare. Methods A simulated clinical scenario of a patient with bacterial meningitis secondary to mastoiditis was presented in three independent sessions to seven publicly accessible LLMs (Bard, Bing, Claude-2, GTP-3.5, GTP-4, Llama, PaLM). Responses were evaluated for adherence to good clinical practice and two international meningitis guidelines. Results A central nervous system infection was identified in 90% of LLM sessions. All recommended imaging, while 81% suggested lumbar puncture. Blood cultures and specific mastoiditis work-up were proposed in only 62% and 38% sessions, respectively. Only 38% of sessions provided the correct empirical antibiotic treatment, while antiviral treatment and dexamethasone were advised in 33% and 24%, respectively. Misleading statements were generated in 52%. No significant correlation was found between LLMs’ text length and performance (r=0.29, p=0.20). Among all LLMs, GTP-4 demonstrated the best performance. Discussion Latest LLMs provide valuable advice on differential diagnosis and diagnostic procedures but significantly vary in treatment-specific information for bacterial meningitis when introduced to a realistic clinical scenario. Misleading statements were common, with performance differences attributed to each LLM’s unique algorithm rather than output length. Conclusions Users must be aware of such limitations and performance variability when considering LLMs as a support tool for medical decision-making. Further research is needed to refine these models' comprehension of complex medical scenarios and their ability to provide reliable information. Data are available upon reasonable request.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大语言模型在脑膜炎管理宣传方面的表现:一项定性比较研究
目的 我们旨在通过一个假设的医疗案例来检验大型语言模型(LLMs)对细菌性脑膜炎指南的遵从情况,从而突出其在医疗保健领域的实用性和局限性。方法 将一个继发于乳突炎的细菌性脑膜炎患者的模拟临床情景分三次展示给七个可公开访问的大型语言模型(Bard、Bing、Claude-2、GTP-3.5、GTP-4、Llama、PaLM)。根据良好临床实践和两份国际脑膜炎指南,对回复进行了评估。结果 90% 的 LLM 会议确定了中枢神经系统感染。所有人都建议进行影像学检查,81%的人建议进行腰椎穿刺。分别只有 62% 和 38% 的会议建议进行血液培养和特定乳突炎检查。只有 38% 的会议提供了正确的经验性抗生素治疗,而分别有 33% 和 24% 的会议建议进行抗病毒治疗和地塞米松治疗。有 52% 的陈述具有误导性。结果表明,语言学习者的文字长度与学习成绩之间没有明显的相关性(r=0.29,p=0.20)。在所有 LLM 中,GTP-4 的性能最佳。讨论 最新的 LLM 在鉴别诊断和诊断程序方面提供了有价值的建议,但在引入真实的临床场景时,在细菌性脑膜炎的治疗特异性信息方面存在显著差异。误导性陈述很常见,性能差异归因于每个 LLM 的独特算法而非输出长度。结论 用户在考虑将 LLM 作为医疗决策支持工具时,必须意识到这些局限性和性能差异。还需要进一步的研究来完善这些模型对复杂医疗场景的理解能力以及提供可靠信息的能力。如有合理要求,可提供相关数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.10
自引率
4.90%
发文量
40
审稿时长
18 weeks
期刊最新文献
Scaling equitable artificial intelligence in healthcare with machine learning operations. Understanding prescribing errors for system optimisation: the technology-related error mechanism classification. Detection of hypertension from pharyngeal images using deep learning algorithm in primary care settings in Japan. PubMed captures more fine-grained bibliographic data on scientific commentary than Web of Science: a comparative analysis. Method to apply temporal graph analysis on electronic patient record data to explore healthcare professional-patient interaction intensity: a cohort study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1