首页 > 最新文献

Journal of the American Medical Informatics Association最新文献

英文 中文
Testing and evaluation of generative large language models in electronic health record applications: a systematic review. 电子健康记录应用中生成大语言模型的测试和评估:系统回顾。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf233
Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou

Background: The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.

Methods: We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.

Results: Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.

Conclusion: Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.

背景:生成式大型语言模型(llm)与电子健康记录(EHR)数据的使用正在迅速扩展,以支持临床和研究任务。这篇系统的综述描述了迄今为止已经研究和评估的临床领域和用例。方法:我们按照系统评价和荟萃分析的首选报告项目指南对PubMed和Web of Science在2023年1月1日至2024年11月9日期间发表的文章进行了系统评价。如果研究使用生成法学硕士来分析现实世界的电子病历数据和报告的定量绩效评估,则纳入研究。通过数据提取,我们确定了每篇纳入文章的临床专业和任务,并总结了评估方法。结果:在检索到的18735篇文章中,196篇符合我们的标准。大多数研究集中在放射学(26.0%)、肿瘤学(10.7%)和急诊医学(6.6%)。关于临床任务,临床决策支持占研究的最大比例(62.2%),而总结和患者沟通所占比例最小,分别为5.6%和5.1%。此外,GPT-4和GPT-3.5是最常用的生成型LLMs,分别出现在60.2%和57.7%的研究中。在这些研究中,我们确定了22个独特的非NLP指标和35个独特的NLP指标。虽然NLP指标提供了更大的可扩展性,但没有一个显示出与黄金标准的人类评估有很强的相关性。结论:我们的研究结果强调需要在更广泛的临床专业和任务中评估基于EHR数据的生成法学硕士,以及迫切需要标准化、可扩展和临床有意义的评估框架。
{"title":"Testing and evaluation of generative large language models in electronic health record applications: a systematic review.","authors":"Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou","doi":"10.1093/jamia/ocaf233","DOIUrl":"https://doi.org/10.1093/jamia/ocaf233","url":null,"abstract":"<p><strong>Background: </strong>The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.</p><p><strong>Methods: </strong>We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.</p><p><strong>Results: </strong>Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.</p><p><strong>Conclusion: </strong>Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Digital interdependence: impact of work spillover during clinical team handoffs. 数字化相互依赖:临床团队交接过程中工作溢出的影响。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf212
Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson

Objective: To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.

Materials and methods: Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.

Results: We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).

Discussion: We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.

Conclusion: Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.

目的:描述住院病人轮班相互依赖的医生电子健康记录(EHR)工作的性质和后果。材料和方法:汇总横断面分析与某学术医疗中心医院内科患者相关的EHR元数据,时间为2022年1 - 6月。使用患者日观察数据,我们使用混合效应回归模型与日间医生随机效应来检验夜班行为(交接时间,总电子病历时间)作为前日间团队行为的函数。我们还评估了夜间患者病情恶化是否可以通过跨班次的团队协调行为来预测。结果:共观察19 671患者日(N = 2708次就诊)。医生始终如一地使用交接工具,通常每班花费8-12分钟编辑患者信息。当日间服务团队更活跃时(最高的交接时间,整体电子病历时间),夜班的电子病历工作水平增加,患者夜间下降的风险增加。(例如,忙预示着忙)。然而,较低的白班激活水平也与夜班溢出效应有关,包括较高的夜间电子病历工作和患者临床衰退的可能性增加。与中间时段相比,白班EHR时间最低和最高时段的患者日夜间下降的相对风险增加了1个百分点(基线患病率为4.4%)(P = 0.04)。讨论:我们发现了从白班到夜班的电子病历工作溢出的证据。此外,最低和最高水平的白班电子病历活动与夜间患者下降的风险增加有关。结果是相关的,并激励进一步检查其他混杂因素。结论:分析揭示了解决跨班次任务相互依赖的机会,利用技术灵活地塑造和支持复杂临床环境中的协作团队实践。
{"title":"Digital interdependence: impact of work spillover during clinical team handoffs.","authors":"Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson","doi":"10.1093/jamia/ocaf212","DOIUrl":"https://doi.org/10.1093/jamia/ocaf212","url":null,"abstract":"<p><strong>Objective: </strong>To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.</p><p><strong>Materials and methods: </strong>Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.</p><p><strong>Results: </strong>We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).</p><p><strong>Discussion: </strong>We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.</p><p><strong>Conclusion: </strong>Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Measuring the accuracy of electronic health record-based phenotyping in the All of Us Research Program to optimize statistical power for genetic association testing. 测量我们所有人研究计划中基于电子健康记录的表型的准确性,以优化遗传关联测试的统计能力。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf234
John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah

Objective: Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.

Materials and methods: We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.

Results: We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.

Discussion: EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.

目的:准确的表型是研究人员利用电子健康记录(EHR)相关的生物银行项目,如我们所有人研究项目来研究人类遗传学的一项重要任务。然而,关于如何选择基于ehr的表型程序以最大化下游统计能力的指导很少。本研究旨在估计All of Us (v7 release)中卵巢癌、女性乳腺癌和结直肠癌三种表型定义的准确性,并确定哪种表型定义最有可能优化遗传关联检测的下游统计能力。材料和方法:我们使用已知风险基因中有害变异的经验载体频率来估计每种表型定义的准确性,并在考虑结果错误分类的概率后计算统计功率。结果:我们发现,表型定义的选择对关联检测的统计能力有重大影响,没有一种方法对所有被测疾病都是最佳的。对于较为罕见的疾病和外显率中等或频率较低的目标风险等位基因,对功率的影响尤为严重。此外,我们的结果表明,高复杂性表型算法的准确性在All of Us的黑人和非西班牙裔白人参与者中是不一致的,突出了病例确定偏差影响下游关联测试的可能性。讨论:基于ehr的表型分型是最大限度地检测我们所有人的新型风险等位基因的瓶颈,也是研究人员应该意识到的差异结果错误分类的潜在来源。我们讨论了这方面的影响以及潜在的缓解策略。
{"title":"Measuring the accuracy of electronic health record-based phenotyping in the All of Us Research Program to optimize statistical power for genetic association testing.","authors":"John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah","doi":"10.1093/jamia/ocaf234","DOIUrl":"https://doi.org/10.1093/jamia/ocaf234","url":null,"abstract":"<p><strong>Objective: </strong>Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.</p><p><strong>Materials and methods: </strong>We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.</p><p><strong>Results: </strong>We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.</p><p><strong>Discussion: </strong>EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Digital health literacy as mediator between language preference and telehealth use among Latinos in the United States. 数字健康素养作为语言偏好与美国拉丁裔远程医疗使用之间的中介。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf232
Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas

Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.

使用2023-2024年美国国家健康访谈调查数据,我们发现数字健康素养(dHL)介导了非英语和英语语言偏好的拉丁裔成年人远程医疗使用差异的近一半。这些发现确定dHL是连接语言和数字获取障碍的可修改机制,强调了多语言、包容和公平的远程医疗设计的必要性。
{"title":"Digital health literacy as mediator between language preference and telehealth use among Latinos in the United States.","authors":"Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas","doi":"10.1093/jamia/ocaf232","DOIUrl":"https://doi.org/10.1093/jamia/ocaf232","url":null,"abstract":"<p><p>Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance. 审计师模型可以抑制人工智能预测的不佳,从而提高人类与人工智能的协作性能。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf235
Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker

Objective: Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.

Materials and methods: We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.

Results: When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.

Conclusion: Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.

目的:医疗保健决策越来越多地在机器学习(ML)的帮助下做出。已知ML具有不公平性-跨亚群的结果不一致。与这些系统互动的临床医生可能会因过度依赖而使这种不公平永久化。最近研究机器学习抑制的工作-基于审计机器学习的沉默预测-显示出减轻过度依赖引起的性能问题的希望。本研究旨在评估抑制对协作公平性的影响,并评估ML不确定性作为审计ML所需的数据。材料和方法:我们使用范德比尔特大学医学中心电子健康记录(n = 58 817)和mimic -ⅳ- ed数据集(n = 363 145)的数据,使用梯度增强树和人工高性能oracle模型来预测死亡或重症监护单位转移的可能性以及30天再入院的可能性。我们直接从数据集中得出临床医生的决策,并基于先前接受临床决策支持警报的经验工作模拟临床医生对ML预测的接受程度。我们用接收器工作特性曲线下的面积和使用绝对平均赔率差的算法公平性来测量性能。结果:当机器学习的表现优于人类时,抑制的表现优于单独的人类(P结论:通过审计师模型抑制低质量的机器学习预测,有望提高人类与人工智能的协作性能和公平性。
{"title":"Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance.","authors":"Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker","doi":"10.1093/jamia/ocaf235","DOIUrl":"https://doi.org/10.1093/jamia/ocaf235","url":null,"abstract":"<p><strong>Objective: </strong>Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.</p><p><strong>Materials and methods: </strong>We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.</p><p><strong>Results: </strong>When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.</p><p><strong>Conclusion: </strong>Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structural insights into clinical large language models and their barriers to translational readiness. 对临床大型语言模型的结构见解及其对翻译准备的障碍。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-11 DOI: 10.1093/jamia/ocaf230
Jiwon You, Hangsik Shin

Background: Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.

Objective: We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.

Methods: We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.

Results: Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.

Conclusion: Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.

背景:尽管临床大语言模型(llm)快速融入临床决策,但由于结构表征不足和外部验证有限,临床大语言模型(llm)面临着巨大的翻译障碍。目的:我们系统地绘制临床法学硕士研究景观,以确定影响其准备为现实世界的临床部署的关键结构模式。方法:我们使用结构化证据图谱方法,确定了2020年1月至2025年3月期间发表的73项临床法学硕士研究。为了确保研究选择的透明度和可重复性,我们遵循了PRISMA 2020框架中的关键原则。每项研究按临床任务、基础架构、对齐策略、数据类型、语言、研究设计、验证方法和评估指标进行分类。结果:研究通常涉及多个早期临床任务-问答(56.2%),知识结构(31.5%)和疾病预测(43.8%)-主要使用文本数据(52.1%)和英语资源(80.8%)。GPT模型倾向于检索增强生成(43.8%),LLaMA模型一贯采用多阶段预训练和微调策略。只有6.9%的研究包括外部验证,前瞻性设计仅在4.1%的病例中观察到,反映了翻译可靠性的显著差距。评估主要是定量的(79.5%),尽管定性和混合方法越来越多地被认可为评估临床可用性和可信度。结论:临床法学硕士研究仍然是探索性的,在语言、数据类型和临床环境方面的通用性有限。为了弥补这一差距,未来的研究必须优先考虑多语言和多模式的培训,前瞻性研究设计与严格的外部验证,以及结合定量表现和定性临床可用性指标的混合评估框架。
{"title":"Structural insights into clinical large language models and their barriers to translational readiness.","authors":"Jiwon You, Hangsik Shin","doi":"10.1093/jamia/ocaf230","DOIUrl":"https://doi.org/10.1093/jamia/ocaf230","url":null,"abstract":"<p><strong>Background: </strong>Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.</p><p><strong>Objective: </strong>We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.</p><p><strong>Methods: </strong>We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.</p><p><strong>Results: </strong>Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.</p><p><strong>Conclusion: </strong>Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Information extraction from clinical notes: are we ready to switch to large language models? 从临床记录中提取信息:我们准备好转向大型语言模型了吗?
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-10 DOI: 10.1093/jamia/ocaf213
Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu

Objectives: To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).

Materials and methods: We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.

Results: Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.

Discussion: While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.

Conclusion: Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.

目的:评估指令调谐大型语言模型元AI (LLaMA)-2和LLaMA-3模型在临床信息提取(IE)任务,特别是命名实体识别(NER)和关系提取(RE)中的性能、通用性和计算效率,并与来自变压器的双向编码器表示(BERT)进行比较。材料和方法:我们从4个数据源——ut医师(UTP)(1342个笔记)、转录医学转录样本报告和示例(MTSamples)(146个)、重症监护医学信息市场(MIMIC)-III(50个)和整合生物学和床边信息学(i2b2)(50个)——开发了一个综合注释的1588个临床笔记的数据库,捕获了4个临床实体(问题、测试、药物、其他治疗)和16个修饰词(例如,否定、确定性)。大型语言模型Meta AI-2和LLaMA-3对临床NER和RE进行了指令调整,并以BERT为基准进行了性能测试。结果:大型语言模型元人工智能模型在数据集上始终优于BERT。在数据丰富的环境中(如UTP), LLaMA取得了边际收益(NER改善约1%,RE改善1.5%-3.7%)。在有限的数据条件下(例如,MTSamples, MIMIC-III)和不可见的i2b2数据集,LLaMA-3- 70b提高F1分数超过7%的NER和4%的RE。然而,性能的提高伴随着计算成本的增加,LLaMA模型需要更多的内存和图形处理单元(GPU)小时,运行速度比BERT慢28倍。讨论:虽然LLaMA模型提供了增强的性能,但它们更高的计算需求和更慢的吞吐量突出了平衡性能与实际资源约束的需要。当在llm和BERT之间选择临床IE时,特定应用的考虑是必不可少的。结论:指令调整的LLaMA模型在临床NER和RE任务中表现出良好的前景。但是,必须仔细评估改进的性能和增加的计算成本之间的权衡。我们发布了我们的Kiwi包(https://kiwi.clinicalnlp.org/),以促进LLaMA和BERT模型在临床IE应用中的应用。
{"title":"Information extraction from clinical notes: are we ready to switch to large language models?","authors":"Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu","doi":"10.1093/jamia/ocaf213","DOIUrl":"https://doi.org/10.1093/jamia/ocaf213","url":null,"abstract":"<p><strong>Objectives: </strong>To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).</p><p><strong>Materials and methods: </strong>We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.</p><p><strong>Results: </strong>Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.</p><p><strong>Discussion: </strong>While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.</p><p><strong>Conclusion: </strong>Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145985179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GARDE-Chat: a scalable, open-source platform for building and deploying health chatbots. 一个可扩展的开源平台,用于构建和部署健康聊天机器人。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-10 DOI: 10.1093/jamia/ocaf211
Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto

Background: Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.

Materials and methods: GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.

Results: GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.

Discussion: Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.

Conclusion: GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.

背景:聊天机器人越来越多地用于提供健康教育、患者参与和获得医疗保健服务。GARDE-Chat是一个开源平台,旨在促进基于聊天机器人的数字健康干预措施在不同领域和环境中的开发、部署和传播。材料和方法:gard - chat是通过一个迭代过程开发的,该过程由现实世界的用例提供信息,以指导关键功能的优先级。该工具是作为一个开源平台开发的,旨在促进跨研究和临床领域的协作、广泛传播和影响。结果:gard - chat的主要功能包括:(1)一个可视化的创作界面,允许非程序员设计聊天机器人;(2)支持脚本化、基于大语言模型(LLM)和混合聊天机器人;(3)与研究人员和机构共享聊天机器人的能力;(4)与外部应用程序和数据源集成,如电子健康记录和REDCap;(5)通过网页浏览器或短信传送;(6)支持聊天机器人用户交互分析的详细审计日志。自2022年7月首次发布以来,gard - chat已支持基于聊天机器人的干预措施的开发,并在多项研究中进行了测试,包括针对基因检测、COVID-19检测、戒烟和癌症筛查等主题的大型实用临床试验。讨论:正在进行的挑战包括开发聊天机器人脚本所需的努力,确保llm的安全使用,以及与临床系统的集成。结论:gard - chat是一个创建、实施和传播可扩展的基于聊天机器人的人口健康干预的通用平台。它已经在几项研究中得到验证,并且通过开源机制可供研究人员和医疗保健系统使用。
{"title":"GARDE-Chat: a scalable, open-source platform for building and deploying health chatbots.","authors":"Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto","doi":"10.1093/jamia/ocaf211","DOIUrl":"10.1093/jamia/ocaf211","url":null,"abstract":"<p><strong>Background: </strong>Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.</p><p><strong>Materials and methods: </strong>GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.</p><p><strong>Results: </strong>GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.</p><p><strong>Discussion: </strong>Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.</p><p><strong>Conclusion: </strong>GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798686/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145953525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SDoH-GPT: using large language models to extract social determinants of health. SDoH-GPT:使用大型语言模型提取健康的社会决定因素。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 DOI: 10.1093/jamia/ocaf094
Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding

Objective: Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.

Materials and methods: SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.

Results: Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.

Discussion: This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.

Conclusion: The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.

目的:从医疗记录中提取健康的社会决定因素(SDoHs)在很大程度上依赖于劳动密集型的注释,这些注释通常是特定于任务的,阻碍了可重用性并限制了共享。在这里,我们介绍了SDoH- gpt,这是一个利用少量学习大型语言模型(llm)从非结构化文本中自动提取SDoH的新框架,旨在提高效率和泛化性。材料和方法:SDoH- gpt是一个框架,包括从医疗记录中提取SDoH的few-shot learning LLM方法,以及使用few-shot learning LLM方法生成的注释作为训练数据集继续对SDoH进行分类的XGBoost分类器。少镜头学习LLM方法与XGBoost的独特结合利用了LLM作为少镜头学习器的强度和XGBoost在训练数据集足够时的效率。因此,SDoH- gpt可以在不依赖大量医学注释或昂贵的人为干预的情况下提取SDoH。结果:我们的方法在时间和成本上分别减少了10倍和20倍,并且与人类注释器的一致性非常好,Cohen的kappa测量值高达0.92。LLM和XGBoost的创新组合可以确保高精度和计算效率,同时始终保持0.90+ AUROC分数。讨论:本研究在三个数据集上验证了SDoH-GPT,并强调了利用LLM和XGBoost彻底改变医疗记录分类的潜力,展示了其在显著减少时间和成本的情况下实现高度准确分类的能力。结论:本研究的关键贡献在于LLM与XGBoost的集成,实现了高成本、高质量的SDoH注释。这项研究为SDoH在推动未来医疗保健解决方案方面更易于访问、可扩展和更有影响力奠定了基础。
{"title":"SDoH-GPT: using large language models to extract social determinants of health.","authors":"Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding","doi":"10.1093/jamia/ocaf094","DOIUrl":"10.1093/jamia/ocaf094","url":null,"abstract":"<p><strong>Objective: </strong>Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.</p><p><strong>Materials and methods: </strong>SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.</p><p><strong>Results: </strong>Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.</p><p><strong>Discussion: </strong>This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.</p><p><strong>Conclusion: </strong>The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"67-78"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758468/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144267837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dependence of premature ventricular complexes on heart rate-it's not that simple. 早衰心室复合体对心率的依赖——没那么简单。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 DOI: 10.1093/jamia/ocaf069
Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury

Objective: Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.

Materials and methods: We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.

Results: Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.

Discussion: The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.

Conclusion: Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.

目的:频繁的室性早搏可导致不良的健康状况,如心肌病。在24小时动态心电图记录中,PVC频率与心率(阳性、阴性或中性)之间的线性相关性已被提出作为对患者进行分类和指导β受体阻滞剂治疗的一种方法。我们的目的是评估这种分类对测量方法、不同的24小时周期和室性早搏对心率的非线性依赖性的稳健性。材料和方法:我们分析了48例频繁室性早搏患者(负担1%-44%)的82天动态心电图记录(1-7天)。对于每一个记录,在不同的24小时周期内计算PVC频率与心率之间的线性相关性,并使用不同的长度间隔来确定PVC频率。结果:使用1小时的间隔,只有36.6%的患者在不同的日子里,PVC频率和心率之间的相关性始终为正、负或中性。使用较短的时间间隔,56.1%的患者的相关性是一致的。较短的时间间隔揭示了许多患者PVC频率与心率之间的非线性和分段线性关系。讨论:在不同的24小时周期和间隔时间内,PVC频率和心率之间的相关性的可变性表明,这种关系既不是严格的线性关系,也不是平稳的关系。更好地理解驱动pvc的机制,结合代表这些机制的计算和生物学模型,可以提供对观察到的非线性行为的洞察,并指导更稳健的分类策略。结论:线性相关性作为诊断频发室性早搏的工具应谨慎使用。它对分析的特定24小时期间和用于分割数据的方法很敏感。应该在临床实践中开发和考虑更复杂的分类方法,这些方法可以捕获非线性和时变的依赖关系。
{"title":"Dependence of premature ventricular complexes on heart rate-it's not that simple.","authors":"Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury","doi":"10.1093/jamia/ocaf069","DOIUrl":"10.1093/jamia/ocaf069","url":null,"abstract":"<p><strong>Objective: </strong>Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.</p><p><strong>Materials and methods: </strong>We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.</p><p><strong>Results: </strong>Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.</p><p><strong>Discussion: </strong>The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.</p><p><strong>Conclusion: </strong>Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"90-97"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144055982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of the American Medical Informatics Association
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1