为循证医学中的大型语言模型设定基准。

IF 7.7 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Journal of Biomedical and Health Informatics Pub Date : 2024-10-21 DOI:10.1109/JBHI.2024.3483816

Jin Li;Yiyan Deng;Qi Sun;Junjie Zhu;Yu Tian;Jingsong Li;Tingting Zhu

{"title":"为循证医学中的大型语言模型设定基准。","authors":"Jin Li;Yiyan Deng;Qi Sun;Junjie Zhu;Yu Tian;Jingsong Li;Tingting Zhu","doi":"10.1109/JBHI.2024.3483816","DOIUrl":null,"url":null,"abstract":"Evidence-based medicine (EBM) represents a paradigm of providing patient care grounded in the most current and rigorously evaluated research. Recent advances in large language models (LLMs) offer a potential solution to transform EBM by automating labor-intensive tasks and thereby improving the efficiency of clinical decision-making. This study explores integrating LLMs into the key stages in EBM, evaluating their ability across evidence retrieval (PICO extraction, biomedical question answering), synthesis (summarizing randomized controlled trials), and dissemination (medical text simplification). We conducted a comparative analysis of seven LLMs, including both proprietary and open-source models, as well as those fine-tuned on medical corpora. Specifically, we benchmarked the performance of various LLMs on each EBM task under zero-shot settings as baselines, and employed prompting techniques, including in-context learning, chain-of-thought reasoning, and knowledge-guided prompting to enhance their capabilities. Our extensive experiments revealed the strengths of LLMs, such as remarkable understanding capabilities even in zero-shot settings, strong summarization skills, and effective knowledge transfer via prompting. Promoting strategies such as knowledge-guided prompting proved highly effective (e.g., improving the performance of GPT-4 by 13.10% over zero-shot in PICO extraction). However, the experiments also showed limitations, with LLM performance falling well below state-of-the-art baselines like PubMedBERT in handling named entity recognition tasks. Moreover, human evaluation revealed persisting challenges with factual inconsistencies and domain inaccuracies, underscoring the need for rigorous quality control before clinical application. This study provides insights into enhancing EBM using LLMs while highlighting critical areas for further research.","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"29 9","pages":"6143-6156"},"PeriodicalIF":7.7000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking Large Language Models in Evidence-Based Medicine\",\"authors\":\"Jin Li;Yiyan Deng;Qi Sun;Junjie Zhu;Yu Tian;Jingsong Li;Tingting Zhu\",\"doi\":\"10.1109/JBHI.2024.3483816\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Evidence-based medicine (EBM) represents a paradigm of providing patient care grounded in the most current and rigorously evaluated research. Recent advances in large language models (LLMs) offer a potential solution to transform EBM by automating labor-intensive tasks and thereby improving the efficiency of clinical decision-making. This study explores integrating LLMs into the key stages in EBM, evaluating their ability across evidence retrieval (PICO extraction, biomedical question answering), synthesis (summarizing randomized controlled trials), and dissemination (medical text simplification). We conducted a comparative analysis of seven LLMs, including both proprietary and open-source models, as well as those fine-tuned on medical corpora. Specifically, we benchmarked the performance of various LLMs on each EBM task under zero-shot settings as baselines, and employed prompting techniques, including in-context learning, chain-of-thought reasoning, and knowledge-guided prompting to enhance their capabilities. Our extensive experiments revealed the strengths of LLMs, such as remarkable understanding capabilities even in zero-shot settings, strong summarization skills, and effective knowledge transfer via prompting. Promoting strategies such as knowledge-guided prompting proved highly effective (e.g., improving the performance of GPT-4 by 13.10% over zero-shot in PICO extraction). However, the experiments also showed limitations, with LLM performance falling well below state-of-the-art baselines like PubMedBERT in handling named entity recognition tasks. Moreover, human evaluation revealed persisting challenges with factual inconsistencies and domain inaccuracies, underscoring the need for rigorous quality control before clinical application. This study provides insights into enhancing EBM using LLMs while highlighting critical areas for further research.\",\"PeriodicalId\":13073,\"journal\":{\"name\":\"IEEE Journal of Biomedical and Health Informatics\",\"volume\":\"29 9\",\"pages\":\"6143-6156\"},\"PeriodicalIF\":7.7000,\"publicationDate\":\"2024-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Biomedical and Health Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10723298/\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10723298/","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

循证医学（EBM）是一种以最新的、经过严格评估的研究成果为基础为患者提供医疗服务的模式。大型语言模型（LLMs）的最新进展提供了一种潜在的解决方案，通过将劳动密集型任务自动化来改变循证医学，从而提高临床决策的效率。本研究探讨了将 LLMs 整合到 EBM 关键阶段的问题，评估了它们在证据检索（PICO 提取、生物医学问题解答）、综合（随机对照试验总结）和传播（医学文本简化）方面的能力。我们对七种 LLM 进行了比较分析，其中包括专有模型和开源模型，以及在医学语料库中经过微调的模型。具体来说，我们以零拍设置为基准，对各种 LLM 在每个 EBM 任务上的性能进行了基准测试，并采用了提示技术，包括上下文学习、思维链推理和知识引导提示，以增强它们的能力。我们的大量实验揭示了 LLMs 的优势，例如即使在零镜头设置下也有出色的理解能力、很强的总结技能以及通过提示进行有效的知识转移。事实证明，知识引导提示等促进策略非常有效（例如，在 PICO 提取方面，GPT-4 的性能比零镜头提高了 13.10%）。不过，实验也显示出了局限性，在处理命名实体识别任务时，LLM 的性能远远低于 PubMedBERT 等最先进的基线。此外，人工评估显示，事实不一致和领域不准确的问题依然存在，这突出表明在临床应用之前需要进行严格的质量控制。这项研究为利用 LLMs 增强 EBM 提供了见解，同时也突出了有待进一步研究的关键领域。代码可在 Github 上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Benchmarking Large Language Models in Evidence-Based Medicine

Evidence-based medicine (EBM) represents a paradigm of providing patient care grounded in the most current and rigorously evaluated research. Recent advances in large language models (LLMs) offer a potential solution to transform EBM by automating labor-intensive tasks and thereby improving the efficiency of clinical decision-making. This study explores integrating LLMs into the key stages in EBM, evaluating their ability across evidence retrieval (PICO extraction, biomedical question answering), synthesis (summarizing randomized controlled trials), and dissemination (medical text simplification). We conducted a comparative analysis of seven LLMs, including both proprietary and open-source models, as well as those fine-tuned on medical corpora. Specifically, we benchmarked the performance of various LLMs on each EBM task under zero-shot settings as baselines, and employed prompting techniques, including in-context learning, chain-of-thought reasoning, and knowledge-guided prompting to enhance their capabilities. Our extensive experiments revealed the strengths of LLMs, such as remarkable understanding capabilities even in zero-shot settings, strong summarization skills, and effective knowledge transfer via prompting. Promoting strategies such as knowledge-guided prompting proved highly effective (e.g., improving the performance of GPT-4 by 13.10% over zero-shot in PICO extraction). However, the experiments also showed limitations, with LLM performance falling well below state-of-the-art baselines like PubMedBERT in handling named entity recognition tasks. Moreover, human evaluation revealed persisting challenges with factual inconsistencies and domain inaccuracies, underscoring the need for rigorous quality control before clinical application. This study provides insights into enhancing EBM using LLMs while highlighting critical areas for further research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal of Biomedical and Health Informatics COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

13.60

自引率

6.50%

发文量

1151

期刊介绍： IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.