Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou
Background: The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.
Methods: We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.
Results: Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.
Conclusion: Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.
背景:生成式大型语言模型(llm)与电子健康记录(EHR)数据的使用正在迅速扩展,以支持临床和研究任务。这篇系统的综述描述了迄今为止已经研究和评估的临床领域和用例。方法:我们按照系统评价和荟萃分析的首选报告项目指南对PubMed和Web of Science在2023年1月1日至2024年11月9日期间发表的文章进行了系统评价。如果研究使用生成法学硕士来分析现实世界的电子病历数据和报告的定量绩效评估,则纳入研究。通过数据提取,我们确定了每篇纳入文章的临床专业和任务,并总结了评估方法。结果:在检索到的18735篇文章中,196篇符合我们的标准。大多数研究集中在放射学(26.0%)、肿瘤学(10.7%)和急诊医学(6.6%)。关于临床任务,临床决策支持占研究的最大比例(62.2%),而总结和患者沟通所占比例最小,分别为5.6%和5.1%。此外,GPT-4和GPT-3.5是最常用的生成型LLMs,分别出现在60.2%和57.7%的研究中。在这些研究中,我们确定了22个独特的非NLP指标和35个独特的NLP指标。虽然NLP指标提供了更大的可扩展性,但没有一个显示出与黄金标准的人类评估有很强的相关性。结论:我们的研究结果强调需要在更广泛的临床专业和任务中评估基于EHR数据的生成法学硕士,以及迫切需要标准化、可扩展和临床有意义的评估框架。
{"title":"Testing and evaluation of generative large language models in electronic health record applications: a systematic review.","authors":"Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou","doi":"10.1093/jamia/ocaf233","DOIUrl":"https://doi.org/10.1093/jamia/ocaf233","url":null,"abstract":"<p><strong>Background: </strong>The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.</p><p><strong>Methods: </strong>We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.</p><p><strong>Results: </strong>Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.</p><p><strong>Conclusion: </strong>Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson
Objective: To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.
Materials and methods: Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.
Results: We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).
Discussion: We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.
Conclusion: Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.
{"title":"Digital interdependence: impact of work spillover during clinical team handoffs.","authors":"Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson","doi":"10.1093/jamia/ocaf212","DOIUrl":"https://doi.org/10.1093/jamia/ocaf212","url":null,"abstract":"<p><strong>Objective: </strong>To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.</p><p><strong>Materials and methods: </strong>Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.</p><p><strong>Results: </strong>We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).</p><p><strong>Discussion: </strong>We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.</p><p><strong>Conclusion: </strong>Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah
Objective: Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.
Materials and methods: We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.
Results: We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.
Discussion: EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.
目的:准确的表型是研究人员利用电子健康记录(EHR)相关的生物银行项目,如我们所有人研究项目来研究人类遗传学的一项重要任务。然而,关于如何选择基于ehr的表型程序以最大化下游统计能力的指导很少。本研究旨在估计All of Us (v7 release)中卵巢癌、女性乳腺癌和结直肠癌三种表型定义的准确性,并确定哪种表型定义最有可能优化遗传关联检测的下游统计能力。材料和方法:我们使用已知风险基因中有害变异的经验载体频率来估计每种表型定义的准确性,并在考虑结果错误分类的概率后计算统计功率。结果:我们发现,表型定义的选择对关联检测的统计能力有重大影响,没有一种方法对所有被测疾病都是最佳的。对于较为罕见的疾病和外显率中等或频率较低的目标风险等位基因,对功率的影响尤为严重。此外,我们的结果表明,高复杂性表型算法的准确性在All of Us的黑人和非西班牙裔白人参与者中是不一致的,突出了病例确定偏差影响下游关联测试的可能性。讨论:基于ehr的表型分型是最大限度地检测我们所有人的新型风险等位基因的瓶颈,也是研究人员应该意识到的差异结果错误分类的潜在来源。我们讨论了这方面的影响以及潜在的缓解策略。
{"title":"Measuring the accuracy of electronic health record-based phenotyping in the All of Us Research Program to optimize statistical power for genetic association testing.","authors":"John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah","doi":"10.1093/jamia/ocaf234","DOIUrl":"https://doi.org/10.1093/jamia/ocaf234","url":null,"abstract":"<p><strong>Objective: </strong>Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.</p><p><strong>Materials and methods: </strong>We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.</p><p><strong>Results: </strong>We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.</p><p><strong>Discussion: </strong>EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas
Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.
{"title":"Digital health literacy as mediator between language preference and telehealth use among Latinos in the United States.","authors":"Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas","doi":"10.1093/jamia/ocaf232","DOIUrl":"https://doi.org/10.1093/jamia/ocaf232","url":null,"abstract":"<p><p>Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker
Objective: Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.
Materials and methods: We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.
Results: When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.
Conclusion: Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.
{"title":"Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance.","authors":"Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker","doi":"10.1093/jamia/ocaf235","DOIUrl":"https://doi.org/10.1093/jamia/ocaf235","url":null,"abstract":"<p><strong>Objective: </strong>Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.</p><p><strong>Materials and methods: </strong>We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.</p><p><strong>Results: </strong>When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.</p><p><strong>Conclusion: </strong>Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.
Objective: We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.
Methods: We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.
Results: Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.
Conclusion: Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.
{"title":"Structural insights into clinical large language models and their barriers to translational readiness.","authors":"Jiwon You, Hangsik Shin","doi":"10.1093/jamia/ocaf230","DOIUrl":"https://doi.org/10.1093/jamia/ocaf230","url":null,"abstract":"<p><strong>Background: </strong>Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.</p><p><strong>Objective: </strong>We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.</p><p><strong>Methods: </strong>We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.</p><p><strong>Results: </strong>Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.</p><p><strong>Conclusion: </strong>Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu
Objectives: To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).
Materials and methods: We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.
Results: Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.
Discussion: While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.
Conclusion: Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.
{"title":"Information extraction from clinical notes: are we ready to switch to large language models?","authors":"Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu","doi":"10.1093/jamia/ocaf213","DOIUrl":"https://doi.org/10.1093/jamia/ocaf213","url":null,"abstract":"<p><strong>Objectives: </strong>To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).</p><p><strong>Materials and methods: </strong>We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.</p><p><strong>Results: </strong>Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.</p><p><strong>Discussion: </strong>While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.</p><p><strong>Conclusion: </strong>Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145985179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto
Background: Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.
Materials and methods: GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.
Results: GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.
Discussion: Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.
Conclusion: GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.
{"title":"GARDE-Chat: a scalable, open-source platform for building and deploying health chatbots.","authors":"Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto","doi":"10.1093/jamia/ocaf211","DOIUrl":"10.1093/jamia/ocaf211","url":null,"abstract":"<p><strong>Background: </strong>Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.</p><p><strong>Materials and methods: </strong>GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.</p><p><strong>Results: </strong>GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.</p><p><strong>Discussion: </strong>Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.</p><p><strong>Conclusion: </strong>GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798686/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145953525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding
Objective: Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.
Materials and methods: SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.
Results: Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.
Discussion: This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.
Conclusion: The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.
{"title":"SDoH-GPT: using large language models to extract social determinants of health.","authors":"Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding","doi":"10.1093/jamia/ocaf094","DOIUrl":"10.1093/jamia/ocaf094","url":null,"abstract":"<p><strong>Objective: </strong>Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.</p><p><strong>Materials and methods: </strong>SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.</p><p><strong>Results: </strong>Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.</p><p><strong>Discussion: </strong>This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.</p><p><strong>Conclusion: </strong>The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"67-78"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758468/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144267837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury
Objective: Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.
Materials and methods: We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.
Results: Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.
Discussion: The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.
Conclusion: Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.
{"title":"Dependence of premature ventricular complexes on heart rate-it's not that simple.","authors":"Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury","doi":"10.1093/jamia/ocaf069","DOIUrl":"10.1093/jamia/ocaf069","url":null,"abstract":"<p><strong>Objective: </strong>Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.</p><p><strong>Materials and methods: </strong>We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.</p><p><strong>Results: </strong>Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.</p><p><strong>Discussion: </strong>The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.</p><p><strong>Conclusion: </strong>Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"90-97"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144055982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}