Benchmarking Large Language Models in Evidence-Based Medicine.

IF 6.7 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Journal of Biomedical and Health Informatics Pub Date : 2024-10-21 DOI:10.1109/JBHI.2024.3483816
Jin Li, Yiyan Deng, Qi Sun, Junjie Zhu, Yu Tian, Jingsong Li, Tingting Zhu
{"title":"Benchmarking Large Language Models in Evidence-Based Medicine.","authors":"Jin Li, Yiyan Deng, Qi Sun, Junjie Zhu, Yu Tian, Jingsong Li, Tingting Zhu","doi":"10.1109/JBHI.2024.3483816","DOIUrl":null,"url":null,"abstract":"<p><p>Evidence-based medicine (EBM) represents a paradigm of providing patient care grounded in the most current and rigorously evaluated research. Recent advances in large language models (LLMs) offer a potential solution to transform EBM by automating labor-intensive tasks and thereby improving the efficiency of clinical decision-making. This study explores integrating LLMs into the key stages in EBM, evaluating their ability across evidence retrieval (PICO extraction, biomedical question answering), synthesis (summarizing randomized controlled trials), and dissemination (medical text simplification). We conducted a comparative analysis of seven LLMs, including both proprietary and open-source models, as well as those fine-tuned on medical corpora. Specifically, we benchmarked the performance of various LLMs on each EBM task under zero-shot settings as baselines, and employed prompting techniques, including in-context learning, chain-of-thought reasoning, and knowledge-guided prompting to enhance their capabilities. Our extensive experiments revealed the strengths of LLMs, such as remarkable understanding capabilities even in zero-shot settings, strong summarization skills, and effective knowledge transfer via prompting. Promoting strategies such as knowledge-guided prompting proved highly effective (e.g., improving the performance of GPT-4 by 13.10% over zero-shot in PICO extraction). However, the experiments also showed limitations, with LLM performance falling well below state-of-the-art baselines like PubMedBERT in handling named entity recognition tasks. Moreover, human evaluation revealed persisting challenges with factual inconsistencies and domain inaccuracies, underscoring the need for rigorous quality control before clinical application. This study provides insights into enhancing EBM using LLMs while highlighting critical areas for further research. The code is publicly available on Github.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/JBHI.2024.3483816","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Evidence-based medicine (EBM) represents a paradigm of providing patient care grounded in the most current and rigorously evaluated research. Recent advances in large language models (LLMs) offer a potential solution to transform EBM by automating labor-intensive tasks and thereby improving the efficiency of clinical decision-making. This study explores integrating LLMs into the key stages in EBM, evaluating their ability across evidence retrieval (PICO extraction, biomedical question answering), synthesis (summarizing randomized controlled trials), and dissemination (medical text simplification). We conducted a comparative analysis of seven LLMs, including both proprietary and open-source models, as well as those fine-tuned on medical corpora. Specifically, we benchmarked the performance of various LLMs on each EBM task under zero-shot settings as baselines, and employed prompting techniques, including in-context learning, chain-of-thought reasoning, and knowledge-guided prompting to enhance their capabilities. Our extensive experiments revealed the strengths of LLMs, such as remarkable understanding capabilities even in zero-shot settings, strong summarization skills, and effective knowledge transfer via prompting. Promoting strategies such as knowledge-guided prompting proved highly effective (e.g., improving the performance of GPT-4 by 13.10% over zero-shot in PICO extraction). However, the experiments also showed limitations, with LLM performance falling well below state-of-the-art baselines like PubMedBERT in handling named entity recognition tasks. Moreover, human evaluation revealed persisting challenges with factual inconsistencies and domain inaccuracies, underscoring the need for rigorous quality control before clinical application. This study provides insights into enhancing EBM using LLMs while highlighting critical areas for further research. The code is publicly available on Github.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
为循证医学中的大型语言模型设定基准。
循证医学(EBM)是一种以最新的、经过严格评估的研究成果为基础为患者提供医疗服务的模式。大型语言模型(LLMs)的最新进展提供了一种潜在的解决方案,通过将劳动密集型任务自动化来改变循证医学,从而提高临床决策的效率。本研究探讨了将 LLMs 整合到 EBM 关键阶段的问题,评估了它们在证据检索(PICO 提取、生物医学问题解答)、综合(随机对照试验总结)和传播(医学文本简化)方面的能力。我们对七种 LLM 进行了比较分析,其中包括专有模型和开源模型,以及在医学语料库中经过微调的模型。具体来说,我们以零拍设置为基准,对各种 LLM 在每个 EBM 任务上的性能进行了基准测试,并采用了提示技术,包括上下文学习、思维链推理和知识引导提示,以增强它们的能力。我们的大量实验揭示了 LLMs 的优势,例如即使在零镜头设置下也有出色的理解能力、很强的总结技能以及通过提示进行有效的知识转移。事实证明,知识引导提示等促进策略非常有效(例如,在 PICO 提取方面,GPT-4 的性能比零镜头提高了 13.10%)。不过,实验也显示出了局限性,在处理命名实体识别任务时,LLM 的性能远远低于 PubMedBERT 等最先进的基线。此外,人工评估显示,事实不一致和领域不准确的问题依然存在,这突出表明在临床应用之前需要进行严格的质量控制。这项研究为利用 LLMs 增强 EBM 提供了见解,同时也突出了有待进一步研究的关键领域。代码可在 Github 上公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Journal of Biomedical and Health Informatics
IEEE Journal of Biomedical and Health Informatics COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
CiteScore
13.60
自引率
6.50%
发文量
1151
期刊介绍: IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.
期刊最新文献
Machine Learning Identification and Classification of Mitosis and Migration of Cancer Cells in a Lab-on-CMOS Capacitance Sensing platform. Biomedical Information Integration via Adaptive Large Language Model Construction. BloodPatrol: Revolutionizing Blood Cancer Diagnosis - Advanced Real-Time Detection Leveraging Deep Learning & Cloud Technologies. EEG Detection and Prediction of Freezing of Gait in Parkinson's Disease Based on Spatiotemporal Coherent Modes. Functional Data Analysis of Hand Rotation for Open Surgical Suturing Skill Assessment.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1