Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review

Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Richard Yang, Wenyu Zhang, Xinyi Wang, Rui Zhang, Pengyu Hong, David W. Bates, Li Zhou
{"title":"Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review","authors":"Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Richard Yang, Wenyu Zhang, Xinyi Wang, Rui Zhang, Pengyu Hong, David W. Bates, Li Zhou","doi":"10.1101/2024.08.11.24311828","DOIUrl":null,"url":null,"abstract":"Background: Generative Large language models (LLMs) represent a significant advancement in natural language processing, achieving state-of-the-art performance across various tasks. However, their application in clinical settings using real electronic health records (EHRs) is still rare and presents numerous challenges. Objective: This study aims to systematically review the use of generative LLMs in patient care-related topics involving EHRs, summarize the challenges faced, and suggest future directions. Methods: A Boolean search for peer-reviewed articles was conducted in May 2024 using PubMed and Web of Science to include research articles published since 2023, which was one month after the release of ChatGPT. The search results were deduplicated. Multiple reviewers, including biomedical informaticians, computer scientists, and a physician, screened the publications for eligibility and extracted bibliometric and clinically relevant information. Only papers utilizing generative LLMs to analyze real EHR data were included. We summarized the use of prompt engineering, fine-tuning, multimodal EHR data, and evaluation matrices. Additionally, we identified current challenges in applying LLMs in clinical settings as reported by the included papers and proposed future directions. Results: The initial search identified 6,328 unique studies, with 76 studies included after eligibility screening. Of these, 67 studies (88.2%) employed zero-shot prompting, five of them reported 100% accuracy on five specific clinical tasks. Nine studies used advanced prompting strategies; four tested these strategies experimentally, finding that prompt engineering improved performance, with one study noting a non-linear relationship between the number of examples in a prompt and performance improvement. Eight studies explored fine-tuning generative LLMs, all reported performance improvements on specific tasks, but three of them noted potential performance degradation after fine-tuning on certain tasks. Only two studies utilized multimodal data, which improved LLM-based decision-making and enabled accurate rare disease diagnosis and prognosis. The studies employed 55 different evaluation metrics for 22 purposes, such as correctness, completeness, and conciseness. Two studies investigated LLM bias, with one detecting no bias and the other finding that male patients received more appropriate clinical decision-making suggestions. Six studies identified hallucinations, such as fabricating patient names in structured thyroid ultrasound reports. Additional challenges included but not limited to the impersonal tone of LLM consultations, which made patients uncomfortable, and the difficulty patients had in understanding LLM responses. Conclusion: Our review indicates that few studies have employed advanced computational techniques to enhance LLM performance. The diverse evaluation metrics used highlight the need for standardization. LLMs currently cannot replace physicians due to challenges such as bias, hallucinations, and impersonal responses.","PeriodicalId":18505,"journal":{"name":"medRxiv","volume":"11 42","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.11.24311828","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Generative Large language models (LLMs) represent a significant advancement in natural language processing, achieving state-of-the-art performance across various tasks. However, their application in clinical settings using real electronic health records (EHRs) is still rare and presents numerous challenges. Objective: This study aims to systematically review the use of generative LLMs in patient care-related topics involving EHRs, summarize the challenges faced, and suggest future directions. Methods: A Boolean search for peer-reviewed articles was conducted in May 2024 using PubMed and Web of Science to include research articles published since 2023, which was one month after the release of ChatGPT. The search results were deduplicated. Multiple reviewers, including biomedical informaticians, computer scientists, and a physician, screened the publications for eligibility and extracted bibliometric and clinically relevant information. Only papers utilizing generative LLMs to analyze real EHR data were included. We summarized the use of prompt engineering, fine-tuning, multimodal EHR data, and evaluation matrices. Additionally, we identified current challenges in applying LLMs in clinical settings as reported by the included papers and proposed future directions. Results: The initial search identified 6,328 unique studies, with 76 studies included after eligibility screening. Of these, 67 studies (88.2%) employed zero-shot prompting, five of them reported 100% accuracy on five specific clinical tasks. Nine studies used advanced prompting strategies; four tested these strategies experimentally, finding that prompt engineering improved performance, with one study noting a non-linear relationship between the number of examples in a prompt and performance improvement. Eight studies explored fine-tuning generative LLMs, all reported performance improvements on specific tasks, but three of them noted potential performance degradation after fine-tuning on certain tasks. Only two studies utilized multimodal data, which improved LLM-based decision-making and enabled accurate rare disease diagnosis and prognosis. The studies employed 55 different evaluation metrics for 22 purposes, such as correctness, completeness, and conciseness. Two studies investigated LLM bias, with one detecting no bias and the other finding that male patients received more appropriate clinical decision-making suggestions. Six studies identified hallucinations, such as fabricating patient names in structured thyroid ultrasound reports. Additional challenges included but not limited to the impersonal tone of LLM consultations, which made patients uncomfortable, and the difficulty patients had in understanding LLM responses. Conclusion: Our review indicates that few studies have employed advanced computational techniques to enhance LLM performance. The diverse evaluation metrics used highlight the need for standardization. LLMs currently cannot replace physicians due to challenges such as bias, hallucinations, and impersonal responses.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
2023 年以来电子健康记录中用于患者护理的生成式大型语言模型:系统回顾
背景:生成式大语言模型(LLMs)是自然语言处理领域的一大进步,在各种任务中都能实现最先进的性能。然而,在使用真实电子健康记录(EHR)的临床环境中应用 LLMs 仍然非常罕见,并面临诸多挑战。研究目的本研究旨在系统回顾生成式 LLM 在涉及电子健康记录的患者护理相关主题中的应用,总结面临的挑战,并提出未来的发展方向。研究方法在 2024 年 5 月使用 PubMed 和 Web of Science 对同行评审文章进行了布尔搜索,以纳入自 2023 年(即 ChatGPT 发布一个月后)以来发表的研究文章。搜索结果经过重复处理。包括生物医学信息学家、计算机科学家和一名医生在内的多名审稿人筛选了符合条件的出版物,并提取了文献计量学和临床相关信息。只有利用生成式 LLM 分析真实电子病历数据的论文才被收录。我们总结了提示工程、微调、多模态 EHR 数据和评估矩阵的使用情况。此外,我们还指出了目前在临床环境中应用 LLMs 所面临的挑战,并提出了未来的发展方向。结果:初步检索发现了 6,328 项独特的研究,经过资格筛选后纳入了 76 项研究。其中,67 项研究(88.2%)采用了零镜头提示,其中 5 项研究报告了 5 项特定临床任务的 100% 准确率。九项研究采用了先进的提示策略;四项研究对这些策略进行了实验测试,发现提示工程提高了绩效,其中一项研究指出提示中的示例数量与绩效提高之间存在非线性关系。八项研究对生成式 LLM 进行了微调,所有研究都报告了在特定任务上的成绩提高,但其中三项研究指出,在某些任务上微调后,成绩可能会下降。只有两项研究利用了多模态数据,从而改进了基于 LLM 的决策,实现了罕见病的准确诊断和预后。这些研究针对 22 个目的采用了 55 种不同的评价指标,如正确性、完整性和简洁性。有两项研究调查了 LLM 的偏差,其中一项研究没有发现偏差,另一项研究发现男性患者获得了更合适的临床决策建议。六项研究发现了幻觉,如在结构化甲状腺超声报告中捏造患者姓名。其他挑战包括但不限于:LLM 咨询的语气缺乏人情味,这让患者感到不舒服,以及患者难以理解 LLM 的回答。结论:我们的回顾表明,很少有研究采用先进的计算技术来提高 LLM 的性能。所使用的评价指标也各不相同,这凸显了标准化的必要性。由于存在偏差、幻觉和非个性化回答等挑战,LLM 目前还不能取代医生。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Factors determining hemoglobin levels in vaginally delivered term newborns at public hospitals in Lusaka, Zambia Accurate and cost-efficient whole genome sequencing of hepatitis B virus using Nanopore Mapping Epigenetic Gene Variant Dynamics: Comparative Analysis of Frequency, Functional Impact and Trait Associations in African and European Populations Assessing Population-level Accessibility to Medical College Hospitals in India: A Geospatial Modeling Study Targeted inference to identify drug repositioning candidates in the Danish health registries
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1