首页 > 最新文献

Journal of the American Medical Informatics Association最新文献

英文 中文
The impact of artificial intelligence scribes on physician and advanced practice provider cognitive load and well-being. 人工智能对医生和高级实践提供者认知负荷和健康的影响。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-21 DOI: 10.1093/jamia/ocag005
Kathryn R Schneider, Hillary E Swann-Thomsen, Terry G Ribbens, Lucas A Bahnmaier, Trevor Satterfield, Reme Pullicar, Neeraj Soni

Background and significance: Physician and advanced practice provider (APP) well-being is a critical focus in healthcare. Emerging technology such as generative artificial intelligence (GAI) scribes reduces physician and APP administrative burden created by electronic health records. Early adopters of this technology have demonstrated promising improvements in clinical documentation, well-being, and cognitive load. However, further exploration across professional roles is warranted.

Objective: The goal of this quality improvement initiative was to explore how GAI scribes impacted well-being, cognitive load, and practice efficiency among physicians and APPs across professional roles.

Methods: A cross-sectional anonymous survey was conducted prior to implementation of GAI scribe technology and 3 months after physicians and APPs were onboarded.

Results: Physicians and APPs showed a reduction in cognitive task load following scribe technology implementation. Physicians reported reduced burnout and intent to leave; however, APPs did not have a significant reduction in burnout or intent to leave.

Conclusion: Artificial intelligence scribe technology shows potential for improving well-being among physicians and APPs by reducing cognitive load and clinical documentation time. Although some differences were found, overall, the technology appears to hold promise across professional roles.

背景和意义:医生和高级实践提供者(APP)的福祉是医疗保健的关键焦点。新兴技术,如生成式人工智能(GAI)抄写器,减轻了电子健康记录给医生和APP带来的管理负担。这项技术的早期采用者在临床记录、健康和认知负荷方面表现出了有希望的改善。然而,跨专业角色的进一步探索是必要的。目的:本质量改进计划的目标是探讨GAI抄写员如何影响医生和app跨专业角色的幸福感、认知负荷和实践效率。方法:在GAI抄写技术实施前和医生和app入职后3个月进行横断面匿名调查。结果:采用抄写技术后,医生和app的认知任务负荷有所降低。医生报告说,他们的倦怠和离职意愿减少了;然而,应用程序并没有显著减少倦怠或离职的意图。结论:人工智能抄写技术通过减少认知负荷和临床记录时间,显示出改善医生和app幸福感的潜力。尽管发现了一些差异,但总的来说,这项技术似乎在不同的职业角色中都有前景。
{"title":"The impact of artificial intelligence scribes on physician and advanced practice provider cognitive load and well-being.","authors":"Kathryn R Schneider, Hillary E Swann-Thomsen, Terry G Ribbens, Lucas A Bahnmaier, Trevor Satterfield, Reme Pullicar, Neeraj Soni","doi":"10.1093/jamia/ocag005","DOIUrl":"https://doi.org/10.1093/jamia/ocag005","url":null,"abstract":"<p><strong>Background and significance: </strong>Physician and advanced practice provider (APP) well-being is a critical focus in healthcare. Emerging technology such as generative artificial intelligence (GAI) scribes reduces physician and APP administrative burden created by electronic health records. Early adopters of this technology have demonstrated promising improvements in clinical documentation, well-being, and cognitive load. However, further exploration across professional roles is warranted.</p><p><strong>Objective: </strong>The goal of this quality improvement initiative was to explore how GAI scribes impacted well-being, cognitive load, and practice efficiency among physicians and APPs across professional roles.</p><p><strong>Methods: </strong>A cross-sectional anonymous survey was conducted prior to implementation of GAI scribe technology and 3 months after physicians and APPs were onboarded.</p><p><strong>Results: </strong>Physicians and APPs showed a reduction in cognitive task load following scribe technology implementation. Physicians reported reduced burnout and intent to leave; however, APPs did not have a significant reduction in burnout or intent to leave.</p><p><strong>Conclusion: </strong>Artificial intelligence scribe technology shows potential for improving well-being among physicians and APPs by reducing cognitive load and clinical documentation time. Although some differences were found, overall, the technology appears to hold promise across professional roles.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146202888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From use cases to infrastructure: a cross-institutional survey of priorities in data-driven biomedical research. 从用例到基础设施:数据驱动的生物医学研究优先事项的跨机构调查。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-20 DOI: 10.1093/jamia/ocag001
Raja Mazumder, Jonathon Keeney, Luke Johnson, Lori Krammer, Patrick McNeely, Jorge Sepulveda, Danielle Hangen, Maria Martin, Dushyanth Jyothi, Jonas De Almeida, Peter McGarvey, Adil Alaoui, Sarah Cha, Art Sedrakyan, Evan Shoelle, Michael Matheny, Michele LeNoue-Newton, Robert Winter, Stephen Deppen, Vahan Simonyan, Anelia Horvath

Objectives: Federated Ecosystems for Analytics and Standardized Technologies (FEAST) is a modular, cloud-based platform developed through the ARPA-H Biomedical Data Fabric initiative to enable secure, federated analysis of real-world biomedical data. To guide and iteratively refine its modular design, the FEAST team conducted a cross-institutional survey to systematically identify and prioritize research needs related to authorized-access data across diverse biomedical domains. This study presents a structured synthesis of submitted use cases to uncover infrastructure gaps, data integration challenges, and translational opportunities. The results from the survey inform both front-end user-facing functionality and backend data requirements, shaping how the interface supports user interactions, data types, and compliance with security and interoperability standards.

Materials and methods: A structured survey form was distributed to researchers affiliated with participating institutions, including DNA-HIVE, The George Washington University (GW-FEAST), Weill Cornell Medicine, Vanderbilt University Medical Center, Georgetown University, European Bioinformatics Institute, and Kaiser Permanente. Respondents completed standardized fields describing the data types of interest, project goals, analytic methods, and perceived technical barriers. The collected responses were curated and analyzed to identify common needs related to privacy, interoperability, scalability, and workflow reproducibility.

Results: The survey compiled 61 use cases spanning genomics, imaging, clinical phenotyping, EHR-driven analytics, and precision medicine. Common themes included the need for multi-modal data integration, HL7 FHIR-based secure access, federated model training without PII retention, and containerized microservices for scalable deployment. Convergent needs across institutions emphasized consistent demand for FAIR-compliant infrastructure and readiness for real-world data analytics.

Conclusion: The FEAST Use Cases survey provides a cross-sectional view of biomedical informatics priorities grounded in real-world data needs. The findings offer a strategic blueprint for developing federated, privacy-preserving infrastructure to support secure, collaborative, and scalable biomedical research.

目标:分析和标准化技术联邦生态系统(FEAST)是一个模块化、基于云的平台,通过ARPA-H生物医学数据结构计划开发,实现对现实世界生物医学数据的安全、联邦分析。为了指导和迭代改进其模块化设计,FEAST团队进行了一项跨机构调查,以系统地识别和优先考虑与不同生物医学领域授权访问数据相关的研究需求。本研究展示了提交用例的结构化综合,以揭示基础设施差距、数据集成挑战和转化机会。调查结果告知了前端面向用户的功能和后端数据需求,塑造了接口如何支持用户交互、数据类型以及对安全性和互操作性标准的遵从性。材料和方法:向参与机构的研究人员分发了一份结构化的调查表格,这些机构包括DNA-HIVE、乔治华盛顿大学(GW-FEAST)、威尔康奈尔医学、范德比尔特大学医学中心、乔治城大学、欧洲生物信息学研究所和凯撒医疗机构。受访者完成了描述感兴趣的数据类型、项目目标、分析方法和感知到的技术障碍的标准化字段。收集到的响应经过整理和分析,以确定与隐私、互操作性、可伸缩性和工作流再现性相关的共同需求。结果:该调查汇编了61个用例,涵盖基因组学、成像、临床表型、ehr驱动分析和精准医学。常见的主题包括对多模态数据集成的需求、基于HL7 fir的安全访问、不保留PII的联邦模型训练,以及用于可伸缩部署的容器化微服务。跨机构的融合需求强调了对符合fair标准的基础设施的一致需求,并为现实世界的数据分析做好准备。结论:FEAST用例调查提供了基于现实世界数据需求的生物医学信息学优先级的横断面视图。研究结果为开发联邦、隐私保护基础设施提供了战略蓝图,以支持安全、协作和可扩展的生物医学研究。
{"title":"From use cases to infrastructure: a cross-institutional survey of priorities in data-driven biomedical research.","authors":"Raja Mazumder, Jonathon Keeney, Luke Johnson, Lori Krammer, Patrick McNeely, Jorge Sepulveda, Danielle Hangen, Maria Martin, Dushyanth Jyothi, Jonas De Almeida, Peter McGarvey, Adil Alaoui, Sarah Cha, Art Sedrakyan, Evan Shoelle, Michael Matheny, Michele LeNoue-Newton, Robert Winter, Stephen Deppen, Vahan Simonyan, Anelia Horvath","doi":"10.1093/jamia/ocag001","DOIUrl":"https://doi.org/10.1093/jamia/ocag001","url":null,"abstract":"<p><strong>Objectives: </strong>Federated Ecosystems for Analytics and Standardized Technologies (FEAST) is a modular, cloud-based platform developed through the ARPA-H Biomedical Data Fabric initiative to enable secure, federated analysis of real-world biomedical data. To guide and iteratively refine its modular design, the FEAST team conducted a cross-institutional survey to systematically identify and prioritize research needs related to authorized-access data across diverse biomedical domains. This study presents a structured synthesis of submitted use cases to uncover infrastructure gaps, data integration challenges, and translational opportunities. The results from the survey inform both front-end user-facing functionality and backend data requirements, shaping how the interface supports user interactions, data types, and compliance with security and interoperability standards.</p><p><strong>Materials and methods: </strong>A structured survey form was distributed to researchers affiliated with participating institutions, including DNA-HIVE, The George Washington University (GW-FEAST), Weill Cornell Medicine, Vanderbilt University Medical Center, Georgetown University, European Bioinformatics Institute, and Kaiser Permanente. Respondents completed standardized fields describing the data types of interest, project goals, analytic methods, and perceived technical barriers. The collected responses were curated and analyzed to identify common needs related to privacy, interoperability, scalability, and workflow reproducibility.</p><p><strong>Results: </strong>The survey compiled 61 use cases spanning genomics, imaging, clinical phenotyping, EHR-driven analytics, and precision medicine. Common themes included the need for multi-modal data integration, HL7 FHIR-based secure access, federated model training without PII retention, and containerized microservices for scalable deployment. Convergent needs across institutions emphasized consistent demand for FAIR-compliant infrastructure and readiness for real-world data analytics.</p><p><strong>Conclusion: </strong>The FEAST Use Cases survey provides a cross-sectional view of biomedical informatics priorities grounded in real-world data needs. The findings offer a strategic blueprint for developing federated, privacy-preserving infrastructure to support secure, collaborative, and scalable biomedical research.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Positive act of reporting negative results in large language model research: a call for transparency. 在大型语言模型研究中报告负面结果的积极行为:呼吁透明度。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-19 DOI: 10.1093/jamia/ocaf221
Satvik Tripathi, Dana Alkhulaifat, Tessa S Cook

Purpose: To highlight the importance of reporting negative results in large language model (LLM) research, particularly as these systems are increasingly integrated into healthcare.

Potential: LLMs offer transformative capabilities in text generation, summarization, and clinical decision support. Transparent documentation of both successes and failures can accelerate innovation, improve reproducibility, and guide safe deployment.

Caution: Publication bias toward positive findings conceals model limitations, biases, and reproducibility challenges. In healthcare, underreporting failures risks patient safety, ethical lapses, and wasted resources. Structural barriers, including a lack of standards and limited funding for failure analysis, perpetuate this cycle.

Conclusions: Negative results should be recognized as valuable contributions that delineate the boundaries of LLM applicability. Structured reporting, educational initiatives, and stronger incentives for transparency are essential to ensure responsible, equitable, and trustworthy use of LLMs in healthcare.

目的:强调在大型语言模型(LLM)研究中报告负面结果的重要性,特别是当这些系统越来越多地集成到医疗保健中时。潜力:法学硕士提供文本生成、摘要和临床决策支持的变革性能力。成功和失败的透明文档可以加速创新,提高再现性,并指导安全部署。警告:发表偏向于正面发现隐藏了模型的局限性、偏倚和可重复性的挑战。在医疗保健领域,漏报失败会给患者安全、道德缺失和资源浪费带来风险。结构性障碍,包括缺乏标准和有限的资金用于故障分析,使这种循环永久化。结论:阴性结果应被视为划定法学硕士适用性界限的有价值的贡献。结构化报告、教育举措和更强有力的透明度激励措施对于确保在医疗保健领域负责任、公平和可信地使用法学硕士至关重要。
{"title":"Positive act of reporting negative results in large language model research: a call for transparency.","authors":"Satvik Tripathi, Dana Alkhulaifat, Tessa S Cook","doi":"10.1093/jamia/ocaf221","DOIUrl":"https://doi.org/10.1093/jamia/ocaf221","url":null,"abstract":"<p><strong>Purpose: </strong>To highlight the importance of reporting negative results in large language model (LLM) research, particularly as these systems are increasingly integrated into healthcare.</p><p><strong>Potential: </strong>LLMs offer transformative capabilities in text generation, summarization, and clinical decision support. Transparent documentation of both successes and failures can accelerate innovation, improve reproducibility, and guide safe deployment.</p><p><strong>Caution: </strong>Publication bias toward positive findings conceals model limitations, biases, and reproducibility challenges. In healthcare, underreporting failures risks patient safety, ethical lapses, and wasted resources. Structural barriers, including a lack of standards and limited funding for failure analysis, perpetuate this cycle.</p><p><strong>Conclusions: </strong>Negative results should be recognized as valuable contributions that delineate the boundaries of LLM applicability. Structured reporting, educational initiatives, and stronger incentives for transparency are essential to ensure responsible, equitable, and trustworthy use of LLMs in healthcare.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contextualizing key principles to promote a justice-oriented informatics research agenda: proceedings and reflections from an American Medical Informatics Association workshop. 促进公正信息学研究议程的关键原则:美国医学信息学协会研讨会的会议记录和反思。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-19 DOI: 10.1093/jamia/ocaf210
Aparajita Kashyap, Christopher J Allsman, Elizabeth A Campbell, Pooja M Desai, Salvatore G Volpe, Bria P Massey, Tiffani J Bright, Suzanne Bakken, Oliver J Bear Don't Walk Iv, Adrienne Pichon

Objectives: Advancing health through informatics requires attending to justice. Recent policy changes in the United States have introduced significant barriers to promoting justice within informatics due to targeted funding cuts and hostility to science, especially science that prioritizes justice.

Materials and methods: We present five key principles for advancing a justice-oriented informatics agenda, synthesized from our workshop held at the American Medical Informatics Association 2022 Annual Symposium.

Results: These principles are: (1) Recognize knowledge and methodologies across communities; (2) Acknowledge historical and cultural contexts of interactions; (3) Facilitate transparency and accountability through clear measures and metrics; (4) Foster trust and sustainability; and (5) Equitably allocate compensation and resources.

Discussion and conclusion: We discuss barriers to implementing these principles that have arisen since the 2022 workshop and provide recommendations for moving towards justice-oriented informatics. We offer examples of how these principles may be used to frame challenges and adapt to new barriers within BMI.

目标:通过信息学促进健康需要关注正义。美国最近的政策变化,由于有针对性的资金削减和对科学的敌意,特别是对优先考虑正义的科学的敌意,给促进信息学内部的正义带来了重大障碍。材料和方法:我们提出了推进以正义为导向的信息学议程的五个关键原则,综合了我们在美国医学信息学协会2022年年度研讨会上举行的研讨会。结果:这些原则是:(1)识别跨社区的知识和方法;(2)承认互动的历史和文化背景;(3)通过明确的措施和指标促进透明度和问责制;(4)培养信任和可持续性;(5)公平分配薪酬和资源。讨论和结论:我们讨论了自2022年研讨会以来出现的实施这些原则的障碍,并提供了向面向正义的信息学发展的建议。我们提供了一些例子,说明如何使用这些原则来构建挑战并适应BMI中的新障碍。
{"title":"Contextualizing key principles to promote a justice-oriented informatics research agenda: proceedings and reflections from an American Medical Informatics Association workshop.","authors":"Aparajita Kashyap, Christopher J Allsman, Elizabeth A Campbell, Pooja M Desai, Salvatore G Volpe, Bria P Massey, Tiffani J Bright, Suzanne Bakken, Oliver J Bear Don't Walk Iv, Adrienne Pichon","doi":"10.1093/jamia/ocaf210","DOIUrl":"https://doi.org/10.1093/jamia/ocaf210","url":null,"abstract":"<p><strong>Objectives: </strong>Advancing health through informatics requires attending to justice. Recent policy changes in the United States have introduced significant barriers to promoting justice within informatics due to targeted funding cuts and hostility to science, especially science that prioritizes justice.</p><p><strong>Materials and methods: </strong>We present five key principles for advancing a justice-oriented informatics agenda, synthesized from our workshop held at the American Medical Informatics Association 2022 Annual Symposium.</p><p><strong>Results: </strong>These principles are: (1) Recognize knowledge and methodologies across communities; (2) Acknowledge historical and cultural contexts of interactions; (3) Facilitate transparency and accountability through clear measures and metrics; (4) Foster trust and sustainability; and (5) Equitably allocate compensation and resources.</p><p><strong>Discussion and conclusion: </strong>We discuss barriers to implementing these principles that have arisen since the 2022 workshop and provide recommendations for moving towards justice-oriented informatics. We offer examples of how these principles may be used to frame challenges and adapt to new barriers within BMI.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Testing and evaluation of generative large language models in electronic health record applications: a systematic review. 电子健康记录应用中生成大语言模型的测试和评估:系统回顾。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf233
Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou

Background: The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.

Methods: We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.

Results: Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.

Conclusion: Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.

背景:生成式大型语言模型(llm)与电子健康记录(EHR)数据的使用正在迅速扩展,以支持临床和研究任务。这篇系统的综述描述了迄今为止已经研究和评估的临床领域和用例。方法:我们按照系统评价和荟萃分析的首选报告项目指南对PubMed和Web of Science在2023年1月1日至2024年11月9日期间发表的文章进行了系统评价。如果研究使用生成法学硕士来分析现实世界的电子病历数据和报告的定量绩效评估,则纳入研究。通过数据提取,我们确定了每篇纳入文章的临床专业和任务,并总结了评估方法。结果:在检索到的18735篇文章中,196篇符合我们的标准。大多数研究集中在放射学(26.0%)、肿瘤学(10.7%)和急诊医学(6.6%)。关于临床任务,临床决策支持占研究的最大比例(62.2%),而总结和患者沟通所占比例最小,分别为5.6%和5.1%。此外,GPT-4和GPT-3.5是最常用的生成型LLMs,分别出现在60.2%和57.7%的研究中。在这些研究中,我们确定了22个独特的非NLP指标和35个独特的NLP指标。虽然NLP指标提供了更大的可扩展性,但没有一个显示出与黄金标准的人类评估有很强的相关性。结论:我们的研究结果强调需要在更广泛的临床专业和任务中评估基于EHR数据的生成法学硕士,以及迫切需要标准化、可扩展和临床有意义的评估框架。
{"title":"Testing and evaluation of generative large language models in electronic health record applications: a systematic review.","authors":"Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou","doi":"10.1093/jamia/ocaf233","DOIUrl":"https://doi.org/10.1093/jamia/ocaf233","url":null,"abstract":"<p><strong>Background: </strong>The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.</p><p><strong>Methods: </strong>We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.</p><p><strong>Results: </strong>Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.</p><p><strong>Conclusion: </strong>Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Digital interdependence: impact of work spillover during clinical team handoffs. 数字化相互依赖:临床团队交接过程中工作溢出的影响。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf212
Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson

Objective: To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.

Materials and methods: Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.

Results: We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).

Discussion: We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.

Conclusion: Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.

目的:描述住院病人轮班相互依赖的医生电子健康记录(EHR)工作的性质和后果。材料和方法:汇总横断面分析与某学术医疗中心医院内科患者相关的EHR元数据,时间为2022年1 - 6月。使用患者日观察数据,我们使用混合效应回归模型与日间医生随机效应来检验夜班行为(交接时间,总电子病历时间)作为前日间团队行为的函数。我们还评估了夜间患者病情恶化是否可以通过跨班次的团队协调行为来预测。结果:共观察19 671患者日(N = 2708次就诊)。医生始终如一地使用交接工具,通常每班花费8-12分钟编辑患者信息。当日间服务团队更活跃时(最高的交接时间,整体电子病历时间),夜班的电子病历工作水平增加,患者夜间下降的风险增加。(例如,忙预示着忙)。然而,较低的白班激活水平也与夜班溢出效应有关,包括较高的夜间电子病历工作和患者临床衰退的可能性增加。与中间时段相比,白班EHR时间最低和最高时段的患者日夜间下降的相对风险增加了1个百分点(基线患病率为4.4%)(P = 0.04)。讨论:我们发现了从白班到夜班的电子病历工作溢出的证据。此外,最低和最高水平的白班电子病历活动与夜间患者下降的风险增加有关。结果是相关的,并激励进一步检查其他混杂因素。结论:分析揭示了解决跨班次任务相互依赖的机会,利用技术灵活地塑造和支持复杂临床环境中的协作团队实践。
{"title":"Digital interdependence: impact of work spillover during clinical team handoffs.","authors":"Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson","doi":"10.1093/jamia/ocaf212","DOIUrl":"https://doi.org/10.1093/jamia/ocaf212","url":null,"abstract":"<p><strong>Objective: </strong>To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.</p><p><strong>Materials and methods: </strong>Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.</p><p><strong>Results: </strong>We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).</p><p><strong>Discussion: </strong>We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.</p><p><strong>Conclusion: </strong>Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Measuring the accuracy of electronic health record-based phenotyping in the All of Us Research Program to optimize statistical power for genetic association testing. 测量我们所有人研究计划中基于电子健康记录的表型的准确性,以优化遗传关联测试的统计能力。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf234
John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah

Objective: Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.

Materials and methods: We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.

Results: We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.

Discussion: EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.

目的:准确的表型是研究人员利用电子健康记录(EHR)相关的生物银行项目,如我们所有人研究项目来研究人类遗传学的一项重要任务。然而,关于如何选择基于ehr的表型程序以最大化下游统计能力的指导很少。本研究旨在估计All of Us (v7 release)中卵巢癌、女性乳腺癌和结直肠癌三种表型定义的准确性,并确定哪种表型定义最有可能优化遗传关联检测的下游统计能力。材料和方法:我们使用已知风险基因中有害变异的经验载体频率来估计每种表型定义的准确性,并在考虑结果错误分类的概率后计算统计功率。结果:我们发现,表型定义的选择对关联检测的统计能力有重大影响,没有一种方法对所有被测疾病都是最佳的。对于较为罕见的疾病和外显率中等或频率较低的目标风险等位基因,对功率的影响尤为严重。此外,我们的结果表明,高复杂性表型算法的准确性在All of Us的黑人和非西班牙裔白人参与者中是不一致的,突出了病例确定偏差影响下游关联测试的可能性。讨论:基于ehr的表型分型是最大限度地检测我们所有人的新型风险等位基因的瓶颈,也是研究人员应该意识到的差异结果错误分类的潜在来源。我们讨论了这方面的影响以及潜在的缓解策略。
{"title":"Measuring the accuracy of electronic health record-based phenotyping in the All of Us Research Program to optimize statistical power for genetic association testing.","authors":"John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah","doi":"10.1093/jamia/ocaf234","DOIUrl":"https://doi.org/10.1093/jamia/ocaf234","url":null,"abstract":"<p><strong>Objective: </strong>Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.</p><p><strong>Materials and methods: </strong>We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.</p><p><strong>Results: </strong>We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.</p><p><strong>Discussion: </strong>EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Digital health literacy as mediator between language preference and telehealth use among Latinos in the United States. 数字健康素养作为语言偏好与美国拉丁裔远程医疗使用之间的中介。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf232
Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas

Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.

使用2023-2024年美国国家健康访谈调查数据,我们发现数字健康素养(dHL)介导了非英语和英语语言偏好的拉丁裔成年人远程医疗使用差异的近一半。这些发现确定dHL是连接语言和数字获取障碍的可修改机制,强调了多语言、包容和公平的远程医疗设计的必要性。
{"title":"Digital health literacy as mediator between language preference and telehealth use among Latinos in the United States.","authors":"Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas","doi":"10.1093/jamia/ocaf232","DOIUrl":"https://doi.org/10.1093/jamia/ocaf232","url":null,"abstract":"<p><p>Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance. 审计师模型可以抑制人工智能预测的不佳,从而提高人类与人工智能的协作性能。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf235
Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker

Objective: Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.

Materials and methods: We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.

Results: When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.

Conclusion: Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.

目的:医疗保健决策越来越多地在机器学习(ML)的帮助下做出。已知ML具有不公平性-跨亚群的结果不一致。与这些系统互动的临床医生可能会因过度依赖而使这种不公平永久化。最近研究机器学习抑制的工作-基于审计机器学习的沉默预测-显示出减轻过度依赖引起的性能问题的希望。本研究旨在评估抑制对协作公平性的影响,并评估ML不确定性作为审计ML所需的数据。材料和方法:我们使用范德比尔特大学医学中心电子健康记录(n = 58 817)和mimic -ⅳ- ed数据集(n = 363 145)的数据,使用梯度增强树和人工高性能oracle模型来预测死亡或重症监护单位转移的可能性以及30天再入院的可能性。我们直接从数据集中得出临床医生的决策,并基于先前接受临床决策支持警报的经验工作模拟临床医生对ML预测的接受程度。我们用接收器工作特性曲线下的面积和使用绝对平均赔率差的算法公平性来测量性能。结果:当机器学习的表现优于人类时,抑制的表现优于单独的人类(P结论:通过审计师模型抑制低质量的机器学习预测,有望提高人类与人工智能的协作性能和公平性。
{"title":"Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance.","authors":"Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker","doi":"10.1093/jamia/ocaf235","DOIUrl":"10.1093/jamia/ocaf235","url":null,"abstract":"<p><strong>Objective: </strong>Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.</p><p><strong>Materials and methods: </strong>We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.</p><p><strong>Results: </strong>When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.</p><p><strong>Conclusion: </strong>Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structural insights into clinical large language models and their barriers to translational readiness. 对临床大型语言模型的结构见解及其对翻译准备的障碍。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-11 DOI: 10.1093/jamia/ocaf230
Jiwon You, Hangsik Shin

Background: Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.

Objective: We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.

Methods: We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.

Results: Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.

Conclusion: Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.

背景:尽管临床大语言模型(llm)快速融入临床决策,但由于结构表征不足和外部验证有限,临床大语言模型(llm)面临着巨大的翻译障碍。目的:我们系统地绘制临床法学硕士研究景观,以确定影响其准备为现实世界的临床部署的关键结构模式。方法:我们使用结构化证据图谱方法,确定了2020年1月至2025年3月期间发表的73项临床法学硕士研究。为了确保研究选择的透明度和可重复性,我们遵循了PRISMA 2020框架中的关键原则。每项研究按临床任务、基础架构、对齐策略、数据类型、语言、研究设计、验证方法和评估指标进行分类。结果:研究通常涉及多个早期临床任务-问答(56.2%),知识结构(31.5%)和疾病预测(43.8%)-主要使用文本数据(52.1%)和英语资源(80.8%)。GPT模型倾向于检索增强生成(43.8%),LLaMA模型一贯采用多阶段预训练和微调策略。只有6.9%的研究包括外部验证,前瞻性设计仅在4.1%的病例中观察到,反映了翻译可靠性的显著差距。评估主要是定量的(79.5%),尽管定性和混合方法越来越多地被认可为评估临床可用性和可信度。结论:临床法学硕士研究仍然是探索性的,在语言、数据类型和临床环境方面的通用性有限。为了弥补这一差距,未来的研究必须优先考虑多语言和多模式的培训,前瞻性研究设计与严格的外部验证,以及结合定量表现和定性临床可用性指标的混合评估框架。
{"title":"Structural insights into clinical large language models and their barriers to translational readiness.","authors":"Jiwon You, Hangsik Shin","doi":"10.1093/jamia/ocaf230","DOIUrl":"https://doi.org/10.1093/jamia/ocaf230","url":null,"abstract":"<p><strong>Background: </strong>Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.</p><p><strong>Objective: </strong>We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.</p><p><strong>Methods: </strong>We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.</p><p><strong>Results: </strong>Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.</p><p><strong>Conclusion: </strong>Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of the American Medical Informatics Association
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1