Objectives: Atrial fibrillation (AF) is common among intensive care unit (ICU) patients. Effective management of AF in this setting remains a subject of debate, with current guidelines often derived from outpatient studies. This study aims to evaluate the effectiveness of different AF management strategies-both, rhythm, rate, or no control-in reducing mortality in ICU patients using a deep learning-based causal inference model.
Materials and methods: Data from the Medical Information Mart for Intensive Care (MIMIC)-III and MIMIC-IV were utilized, encompassing ICU admissions with documented AF. Exposures included both rhythm and rate, only rhythm, and only rate, or no control. A deep learning-based causal inference model analyzed treatment effects. Additionally, the characteristics of patients who benefited more from rhythm control compared to rate control were identified using treatment effect sizes and multivariable logistic regression.
Results: The study population comprised 13 583 patients. Both rhythm and rate control, rhythm control-only, and rate control-only strategies significantly reduced in-hospital mortality compared to no control, with average treatment effects of -1.23% (-1.43% to -1.03%), -2.32% (-2.48% to -2.15%), and -9.11% (-9.29% to -8.93%), respectively. Rhythm control proved more effective than rate control in specific subgroups: older age, higher maximum heart rate, presence of new-onset AF, absence of hypertension, absence of diabetes, chronic liver disease, not having undergone heart surgery, and the use of vasopressor agents.
Conclusion: Using a deep learning-based causal inference model, we quantified mortality reduction for each treatment strategy and identified the patient characteristics associated with the most favorable outcomes for each strategy.
{"title":"Determining optimal strategies for personalized atrial fibrillation treatment in intensive care unit patients using a deep learning-based causal inference approach: rhythm and/or rate control.","authors":"Min Woo Kang, Shin Young Ahn, Yoonjin Kang","doi":"10.1093/jamia/ocaf203","DOIUrl":"10.1093/jamia/ocaf203","url":null,"abstract":"<p><strong>Objectives: </strong>Atrial fibrillation (AF) is common among intensive care unit (ICU) patients. Effective management of AF in this setting remains a subject of debate, with current guidelines often derived from outpatient studies. This study aims to evaluate the effectiveness of different AF management strategies-both, rhythm, rate, or no control-in reducing mortality in ICU patients using a deep learning-based causal inference model.</p><p><strong>Materials and methods: </strong>Data from the Medical Information Mart for Intensive Care (MIMIC)-III and MIMIC-IV were utilized, encompassing ICU admissions with documented AF. Exposures included both rhythm and rate, only rhythm, and only rate, or no control. A deep learning-based causal inference model analyzed treatment effects. Additionally, the characteristics of patients who benefited more from rhythm control compared to rate control were identified using treatment effect sizes and multivariable logistic regression.</p><p><strong>Results: </strong>The study population comprised 13 583 patients. Both rhythm and rate control, rhythm control-only, and rate control-only strategies significantly reduced in-hospital mortality compared to no control, with average treatment effects of -1.23% (-1.43% to -1.03%), -2.32% (-2.48% to -2.15%), and -9.11% (-9.29% to -8.93%), respectively. Rhythm control proved more effective than rate control in specific subgroups: older age, higher maximum heart rate, presence of new-onset AF, absence of hypertension, absence of diabetes, chronic liver disease, not having undergone heart surgery, and the use of vasopressor agents.</p><p><strong>Conclusion: </strong>Using a deep learning-based causal inference model, we quantified mortality reduction for each treatment strategy and identified the patient characteristics associated with the most favorable outcomes for each strategy.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"679-689"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981632/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145642141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuang Wang, Yang Zhang, Ying Gao, Xin He, Guanghui Deng, Jian Du
Objectives: To develop and evaluate a knowledge graph-augmented large language model (LLM) framework that synthesizes epidemiological evidence to infer life-course exposure-outcome pathways, using gestational diabetes mellitus (GDM) and dementia as a case study.
Materials and methods: We constructed a causal knowledge graph by extracting empirical epidemiological associations from scientific literature, excluding hypothetical assertions. The graph was integrated with GPT-4 through four graph retrieval-augmented generation (GRAG) strategies to infer bridging variables between early-life exposure (GDM) and later-life outcome (dementia). Semantic triples served as structured inputs to support LLM reasoning. Each GRAG strategy was evaluated by human clinical experts and three LLM-based reviewers (GPT-4o, Llama 3-70B, and Gemini Advanced), assessing scientific reliability, novelty, and clinical relevance.
Results: The GRAG strategy using a minimal set of abstracts specifically related to GDM-dementia bridging variables performed comparably to the strategy using broader sub-community abstracts, and both significantly outperformed approaches using the full GDM- or dementia-related corpus or baseline GPT-4 without external augmentation. The knowledge graph-augmented LLM identified 108 maternal candidate mediators, including validated risk factors such as chronic kidney disease and physical inactivity. The structured approach improved accuracy and reduced confabulation compared to standard LLM outputs.
Discussion: Our findings suggest that augmenting LLMs with epidemiological knowledge graphs enables effective reasoning over fragmented literature and supports the reconstruction of progressive risk pathways. Expert assessments revealed that LLMs may overestimate clinical relevance, highlighting the need for human-AI collaboration in interpretation and application.
Conclusion: Integrating semantic epidemiological knowledge with LLMs via GRAG strategies provides a promising framework for life-course epidemiology, enabling early detection of modifiable risk factors and guiding variable selection in cohort study design.
{"title":"Knowledge graph-augmented large language models for reconstructing life course risk pathways: a gestational diabetes mellitus-to-dementia case study.","authors":"Shuang Wang, Yang Zhang, Ying Gao, Xin He, Guanghui Deng, Jian Du","doi":"10.1093/jamia/ocaf219","DOIUrl":"10.1093/jamia/ocaf219","url":null,"abstract":"<p><strong>Objectives: </strong>To develop and evaluate a knowledge graph-augmented large language model (LLM) framework that synthesizes epidemiological evidence to infer life-course exposure-outcome pathways, using gestational diabetes mellitus (GDM) and dementia as a case study.</p><p><strong>Materials and methods: </strong>We constructed a causal knowledge graph by extracting empirical epidemiological associations from scientific literature, excluding hypothetical assertions. The graph was integrated with GPT-4 through four graph retrieval-augmented generation (GRAG) strategies to infer bridging variables between early-life exposure (GDM) and later-life outcome (dementia). Semantic triples served as structured inputs to support LLM reasoning. Each GRAG strategy was evaluated by human clinical experts and three LLM-based reviewers (GPT-4o, Llama 3-70B, and Gemini Advanced), assessing scientific reliability, novelty, and clinical relevance.</p><p><strong>Results: </strong>The GRAG strategy using a minimal set of abstracts specifically related to GDM-dementia bridging variables performed comparably to the strategy using broader sub-community abstracts, and both significantly outperformed approaches using the full GDM- or dementia-related corpus or baseline GPT-4 without external augmentation. The knowledge graph-augmented LLM identified 108 maternal candidate mediators, including validated risk factors such as chronic kidney disease and physical inactivity. The structured approach improved accuracy and reduced confabulation compared to standard LLM outputs.</p><p><strong>Discussion: </strong>Our findings suggest that augmenting LLMs with epidemiological knowledge graphs enables effective reasoning over fragmented literature and supports the reconstruction of progressive risk pathways. Expert assessments revealed that LLMs may overestimate clinical relevance, highlighting the need for human-AI collaboration in interpretation and application.</p><p><strong>Conclusion: </strong>Integrating semantic epidemiological knowledge with LLMs via GRAG strategies provides a promising framework for life-course epidemiology, enabling early detection of modifiable risk factors and guiding variable selection in cohort study design.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"632-640"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981626/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou
Background: The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.
Methods: We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.
Results: Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.
Conclusion: Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.
背景:生成式大型语言模型(llm)与电子健康记录(EHR)数据的使用正在迅速扩展,以支持临床和研究任务。这篇系统的综述描述了迄今为止已经研究和评估的临床领域和用例。方法:我们按照系统评价和荟萃分析的首选报告项目指南对PubMed和Web of Science在2023年1月1日至2024年11月9日期间发表的文章进行了系统评价。如果研究使用生成法学硕士来分析现实世界的电子病历数据和报告的定量绩效评估,则纳入研究。通过数据提取,我们确定了每篇纳入文章的临床专业和任务,并总结了评估方法。结果:在检索到的18735篇文章中,196篇符合我们的标准。大多数研究集中在放射学(26.0%)、肿瘤学(10.7%)和急诊医学(6.6%)。关于临床任务,临床决策支持占研究的最大比例(62.2%),而总结和患者沟通所占比例最小,分别为5.6%和5.1%。此外,GPT-4和GPT-3.5是最常用的生成型LLMs,分别出现在60.2%和57.7%的研究中。在这些研究中,我们确定了22个独特的非NLP指标和35个独特的NLP指标。虽然NLP指标提供了更大的可扩展性,但没有一个显示出与黄金标准的人类评估有很强的相关性。结论:我们的研究结果强调需要在更广泛的临床专业和任务中评估基于EHR数据的生成法学硕士,以及迫切需要标准化、可扩展和临床有意义的评估框架。
{"title":"Testing and evaluation of generative large language models in electronic health record applications: a systematic review.","authors":"Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou","doi":"10.1093/jamia/ocaf233","DOIUrl":"10.1093/jamia/ocaf233","url":null,"abstract":"<p><strong>Background: </strong>The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.</p><p><strong>Methods: </strong>We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.</p><p><strong>Results: </strong>Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.</p><p><strong>Conclusion: </strong>Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"743-753"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981627/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmed Farrag, Ahmed Soliman, Elham Hatef, Amie Goodin, Masoud Rouhizadeh
Objective: This scoping review aimed to (1) map current applications of transformers and large language models (LLMs) for extracting social drivers of health (SDOH) from clinical text, (2) benchmark model performance across SDOH domains, and (3) evaluate methodological rigor to identify research gaps and inform clinical deployment.
Materials and methods: We searched PubMed, Web of Science, Embase, Scopus, and IEEE Xplore for studies applying transformers or LLMs to detect SDOH in clinical narratives. We developed a novel methodological framework integrating (1) hierarchical classification of SDOH domains and transformer/LLM architectures, (2) systematic synthesis of performance metrics, and (3) a 7-domain instrument assessing internal validity, external validity, and reporting transparency.
Results: Forty-two studies met inclusion criteria. Performance varied substantially across SDOH domains. Behavioral Factors achieved the highest median F1-score (0.87), while Health Care Access and Quality showed the lowest performance and greatest variability (median F1 = 0.59). Research concentrated in the United States (85.7%), relied predominantly on private institutional datasets (69%), and focused primarily on critical care populations (45.2%). Methodological assessment revealed critical gaps; only 29% of studies provided annotation guidelines, 24% assessed fairness across demographic groups, and 21% performed external validation.
Discussion: Smaller open-source transformer models show promise for democratizing SDOH detection by achieving competitive performance at lower costs while enabling secure local deployment in resource-limited settings. Advancing clinical readiness requires standardized reporting practices, diverse benchmark datasets across care settings, and systematic equity evaluation to prevent perpetuating health disparities.
Conclusion: Transformer and LLM performance for SDOH detection varied substantially across domains, with encoder-based models excelling at structured tasks and decoder-only models at linguistically complex tasks. Critical gaps in fairness assessment, external validation, and dataset diversity restrict generalizability and readiness for widespread clinical deployment.
目的:本范围综述旨在(1)绘制从临床文本中提取健康社会驱动因素(SDOH)的转换器和大型语言模型(LLMs)的当前应用地图,(2)跨SDOH领域的基准模型性能,以及(3)评估方法的严谨性,以确定研究差距并为临床部署提供信息。材料和方法:我们检索了PubMed、Web of Science、Embase、Scopus和IEEE explore,查找应用transformer或llm检测临床叙述中的SDOH的研究。我们开发了一个新的方法框架,集成了(1)SDOH域和transformer/LLM架构的分层分类,(2)性能指标的系统综合,以及(3)一个评估内部有效性、外部有效性和报告透明度的7域仪器。结果:42项研究符合纳入标准。不同SDOH域的性能差异很大。行为因素的F1中位数得分最高(0.87),而卫生保健可及性和质量的F1中位数得分最低,差异最大(F1中位数= 0.59)。研究集中在美国(85.7%),主要依赖于私人机构数据集(69%),主要关注重症监护人群(45.2%)。方法学评估揭示了关键差距;只有29%的研究提供了注释指南,24%的研究评估了人口统计学群体的公平性,21%的研究进行了外部验证。讨论:较小的开源变压器模型通过在资源有限的环境中实现安全的本地部署,以较低的成本获得具有竞争力的性能,从而展示了民主化SDOH检测的前景。推进临床准备需要标准化的报告实践、跨护理环境的多样化基准数据集以及系统的公平评估,以防止长期存在健康差距。结论:Transformer和LLM用于SDOH检测的性能在不同领域差异很大,基于编码器的模型在结构化任务中表现出色,而仅解码器的模型在语言复杂的任务中表现出色。公平性评估、外部验证和数据集多样性方面的关键差距限制了推广和广泛临床部署的准备。
{"title":"Beyond metrics to methods: a scoping review of transformers and large language models for detection of social drivers of health in clinical notes.","authors":"Ahmed Farrag, Ahmed Soliman, Elham Hatef, Amie Goodin, Masoud Rouhizadeh","doi":"10.1093/jamia/ocaf201","DOIUrl":"10.1093/jamia/ocaf201","url":null,"abstract":"<p><strong>Objective: </strong>This scoping review aimed to (1) map current applications of transformers and large language models (LLMs) for extracting social drivers of health (SDOH) from clinical text, (2) benchmark model performance across SDOH domains, and (3) evaluate methodological rigor to identify research gaps and inform clinical deployment.</p><p><strong>Materials and methods: </strong>We searched PubMed, Web of Science, Embase, Scopus, and IEEE Xplore for studies applying transformers or LLMs to detect SDOH in clinical narratives. We developed a novel methodological framework integrating (1) hierarchical classification of SDOH domains and transformer/LLM architectures, (2) systematic synthesis of performance metrics, and (3) a 7-domain instrument assessing internal validity, external validity, and reporting transparency.</p><p><strong>Results: </strong>Forty-two studies met inclusion criteria. Performance varied substantially across SDOH domains. Behavioral Factors achieved the highest median F1-score (0.87), while Health Care Access and Quality showed the lowest performance and greatest variability (median F1 = 0.59). Research concentrated in the United States (85.7%), relied predominantly on private institutional datasets (69%), and focused primarily on critical care populations (45.2%). Methodological assessment revealed critical gaps; only 29% of studies provided annotation guidelines, 24% assessed fairness across demographic groups, and 21% performed external validation.</p><p><strong>Discussion: </strong>Smaller open-source transformer models show promise for democratizing SDOH detection by achieving competitive performance at lower costs while enabling secure local deployment in resource-limited settings. Advancing clinical readiness requires standardized reporting practices, diverse benchmark datasets across care settings, and systematic equity evaluation to prevent perpetuating health disparities.</p><p><strong>Conclusion: </strong>Transformer and LLM performance for SDOH detection varied substantially across domains, with encoder-based models excelling at structured tasks and decoder-only models at linguistically complex tasks. Critical gaps in fairness assessment, external validation, and dataset diversity restrict generalizability and readiness for widespread clinical deployment.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"754-769"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981650/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146203657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zifeng Wang, Junyi Gao, Benjamin Danek, Brandon Theodorou, Ruba Shaik, Shivashankar Thati, Seunghyun Won, Jimeng Sun
Objectives: Large language models' (LLMs') performance in high-stakes, compliance-driven settings such as drafting clinical research documents remains underexplored. This study aims to build a benchmark and an evaluation framework for assessing LLMs' compliance and factuality in generating informed consent forms (ICFs) from clinical trial protocols.
Materials and methods: We introduce InformBench, a benchmark comprising 900 clinical trial documents, and propose an evaluation framework grounded in regulatory guidelines and site-specific consent templates. We assess LLM performance on transforming trial protocols, often hundreds of pages, into concise, patient-facing ICFs. Additionally, we design InformGen, a retrieval-augmented, human-in-the-loop pipeline aimed at improving generation quality.
Results: Baseline LLMs such as GPT-4o achieved only 70%-80% compliance and exhibited factual errors in 18%-43% of cases. In contrast, InformGen substantially improved outputs, achieving nearly 100% regulatory compliance and over 90% factual accuracy, as validated by 5 domain-expert annotators.
Discussion: The study reveals critical limitations in current LLMs for clinical research document drafting, particularly in regulatory sensitivity and factual grounding. Our results highlight the need for domain-specific benchmarks and structured evaluations to support safe deployment in real-world clinical research workflows.
Conclusion: LLMs offer value in clinical research document generation but must be adapted and rigorously evaluated for high-stakes applications. Our benchmark and framework provide a foundation for improving and assessing LLM-generated outputs in compliance-critical domains.
{"title":"Compliance and factuality of large language models for clinical research document generation.","authors":"Zifeng Wang, Junyi Gao, Benjamin Danek, Brandon Theodorou, Ruba Shaik, Shivashankar Thati, Seunghyun Won, Jimeng Sun","doi":"10.1093/jamia/ocaf174","DOIUrl":"10.1093/jamia/ocaf174","url":null,"abstract":"<p><strong>Objectives: </strong>Large language models' (LLMs') performance in high-stakes, compliance-driven settings such as drafting clinical research documents remains underexplored. This study aims to build a benchmark and an evaluation framework for assessing LLMs' compliance and factuality in generating informed consent forms (ICFs) from clinical trial protocols.</p><p><strong>Materials and methods: </strong>We introduce InformBench, a benchmark comprising 900 clinical trial documents, and propose an evaluation framework grounded in regulatory guidelines and site-specific consent templates. We assess LLM performance on transforming trial protocols, often hundreds of pages, into concise, patient-facing ICFs. Additionally, we design InformGen, a retrieval-augmented, human-in-the-loop pipeline aimed at improving generation quality.</p><p><strong>Results: </strong>Baseline LLMs such as GPT-4o achieved only 70%-80% compliance and exhibited factual errors in 18%-43% of cases. In contrast, InformGen substantially improved outputs, achieving nearly 100% regulatory compliance and over 90% factual accuracy, as validated by 5 domain-expert annotators.</p><p><strong>Discussion: </strong>The study reveals critical limitations in current LLMs for clinical research document drafting, particularly in regulatory sensitivity and factual grounding. Our results highlight the need for domain-specific benchmarks and structured evaluations to support safe deployment in real-world clinical research workflows.</p><p><strong>Conclusion: </strong>LLMs offer value in clinical research document generation but must be adapted and rigorously evaluated for high-stakes applications. Our benchmark and framework provide a foundation for improving and assessing LLM-generated outputs in compliance-critical domains.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"563-572"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981641/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145379323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson
Objective: To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.
Materials and methods: Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.
Results: We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).
Discussion: We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.
Conclusion: Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.
{"title":"Digital interdependence: impact of work spillover during clinical team handoffs.","authors":"Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson","doi":"10.1093/jamia/ocaf212","DOIUrl":"10.1093/jamia/ocaf212","url":null,"abstract":"<p><strong>Objective: </strong>To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.</p><p><strong>Materials and methods: </strong>Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.</p><p><strong>Results: </strong>We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).</p><p><strong>Discussion: </strong>We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.</p><p><strong>Conclusion: </strong>Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"603-610"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981644/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Murat Kantarcioglu, Will Howe, Benmei Liu, Valentina Petkov, Esmeralda Casas-Silva, Diana Velasquez-Kolnik, Bradley A Malin, Lynne Penberthy
Objective: The National Cancer Institute (NCI), part of the National Institutes of Health (NIH) supports efforts to address critical challenges in advancing cancer research. As part of this effort, NCI sponsored the development of a privacy-preserving record linkage (PPRL) software that transforms identifying patient information into multiple tokens through a set of cryptographically secure keyed hash functions. This project aims to evaluate the PPRL software in the perspective of re-identification risks and propose effective strategies to sufficiently mitigate these risks.
Materials and methods: To achieve the goals, we developed a novel re-identification risk assessment framework, based on token frequency analysis, to estimate the privacy impact of hashed tokens shared for record linkage. We assessed privacy risk through empirical analysis on a state-level voter registration database, a public dataset commonly used for re-identification, under various scenarios. These scenarios are defined based on several factors, including the size of the dataset used for linkage and a group size parameter that determines when an adversary can claim that a record has been re-identified.
Results: We found that the re-identification risk based on frequency analysis attack is approximately 0.0002 (ie, 2 patients out of 10 000 are potentially identifiable) under reasonable adversarial settings, with a group size parameter of k = 12 and a dataset size of 400 000 patients. Additionally, our analysis reveals a negative correlation between dataset size and re-identification risk.
Discussion: Re-identification risk is deemed low for the new NCI PPRL software. Token frequency analysis provides a reliable estimate of the re-identification risk in token-based PPRL tools.
{"title":"A novel analysis methodology for assessment of re-identification risks for the National Cancer Institute cancer registry privacy preserving record linkage technique.","authors":"Murat Kantarcioglu, Will Howe, Benmei Liu, Valentina Petkov, Esmeralda Casas-Silva, Diana Velasquez-Kolnik, Bradley A Malin, Lynne Penberthy","doi":"10.1093/jamia/ocaf172","DOIUrl":"10.1093/jamia/ocaf172","url":null,"abstract":"<p><strong>Objective: </strong>The National Cancer Institute (NCI), part of the National Institutes of Health (NIH) supports efforts to address critical challenges in advancing cancer research. As part of this effort, NCI sponsored the development of a privacy-preserving record linkage (PPRL) software that transforms identifying patient information into multiple tokens through a set of cryptographically secure keyed hash functions. This project aims to evaluate the PPRL software in the perspective of re-identification risks and propose effective strategies to sufficiently mitigate these risks.</p><p><strong>Materials and methods: </strong>To achieve the goals, we developed a novel re-identification risk assessment framework, based on token frequency analysis, to estimate the privacy impact of hashed tokens shared for record linkage. We assessed privacy risk through empirical analysis on a state-level voter registration database, a public dataset commonly used for re-identification, under various scenarios. These scenarios are defined based on several factors, including the size of the dataset used for linkage and a group size parameter that determines when an adversary can claim that a record has been re-identified.</p><p><strong>Results: </strong>We found that the re-identification risk based on frequency analysis attack is approximately 0.0002 (ie, 2 patients out of 10 000 are potentially identifiable) under reasonable adversarial settings, with a group size parameter of k = 12 and a dataset size of 400 000 patients. Additionally, our analysis reveals a negative correlation between dataset size and re-identification risk.</p><p><strong>Discussion: </strong>Re-identification risk is deemed low for the new NCI PPRL software. Token frequency analysis provides a reliable estimate of the re-identification risk in token-based PPRL tools.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"663-669"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981660/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145379248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lincen Yang, Siri L van der Meijden, Sesmu M Arbous, Matthijs van Leeuwen
Objective: Estimating readmission risk for intensive care unit (ICU) patients is critical for clinicians to optimize resource allocation and prevent premature discharges. Machine learning models currently applied to this task either lack interpretability or cannot identify patient subgroups with distinctive readmission risks and characteristics. We addressed this gap by introducing a cutting-edge rule-based model, namely truly unordered rule sets (TURS), to reveal heterogeneous readmission risks and subgroup-level patient characteristics.
Materials and methods: We trained TURS on all ICU admissions from January 2011 to January 2020 at Leiden University Medical Center. For each subgroup, patient characteristics and the influence of feature variables on readmission risk were analyzed.
Results: TURS identified subgroups with heterogeneous feature distributions and feature importance, providing actionable insights for ICU discharge planning. Its predictive performance (area under the receiver operating characteristic curve [ROC-AUC] 70.5%) and model complexity (5 rules, average length 2) surpassed other rule-based models.
Discussion: Subgroup analysis highlighted the heterogeneity of patients. First, we compared the conditional probability distribution of each feature variable, conditioned on the fact that a patient was in a certain subgroup, with its unconditional distribution. We identified features deviating from the unconditional distribution, illustrating unique subgroup-specific implications. Furthermore, we demonstrated that features with the highest impact on the readmission risk were distinctive for each subgroup.
Conclusion: The TURS model provided a concise summary of patient subgroups, aiding ICU discharge decisions and advancing knowledge discovery in the ICU.
{"title":"Interpretable machine learning for identifying ICU readmission risk in subgroups with probabilistic rules.","authors":"Lincen Yang, Siri L van der Meijden, Sesmu M Arbous, Matthijs van Leeuwen","doi":"10.1093/jamia/ocaf171","DOIUrl":"10.1093/jamia/ocaf171","url":null,"abstract":"<p><strong>Objective: </strong>Estimating readmission risk for intensive care unit (ICU) patients is critical for clinicians to optimize resource allocation and prevent premature discharges. Machine learning models currently applied to this task either lack interpretability or cannot identify patient subgroups with distinctive readmission risks and characteristics. We addressed this gap by introducing a cutting-edge rule-based model, namely truly unordered rule sets (TURS), to reveal heterogeneous readmission risks and subgroup-level patient characteristics.</p><p><strong>Materials and methods: </strong>We trained TURS on all ICU admissions from January 2011 to January 2020 at Leiden University Medical Center. For each subgroup, patient characteristics and the influence of feature variables on readmission risk were analyzed.</p><p><strong>Results: </strong>TURS identified subgroups with heterogeneous feature distributions and feature importance, providing actionable insights for ICU discharge planning. Its predictive performance (area under the receiver operating characteristic curve [ROC-AUC] 70.5%) and model complexity (5 rules, average length 2) surpassed other rule-based models.</p><p><strong>Discussion: </strong>Subgroup analysis highlighted the heterogeneity of patients. First, we compared the conditional probability distribution of each feature variable, conditioned on the fact that a patient was in a certain subgroup, with its unconditional distribution. We identified features deviating from the unconditional distribution, illustrating unique subgroup-specific implications. Furthermore, we demonstrated that features with the highest impact on the readmission risk were distinctive for each subgroup.</p><p><strong>Conclusion: </strong>The TURS model provided a concise summary of patient subgroups, aiding ICU discharge decisions and advancing knowledge discovery in the ICU.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"690-699"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981653/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145394646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Woo Yeon Park, Teri Sippel Schmidt, Gabriel Salvador, Kevin O'Donnell, Brad Genereaux, Kyulee Jeon, Seng Chan You, Blake E Dewey, Paul Nagy
{"title":"Response to \"toward semantic interoperability of imaging and clinical data: reflections on the DICOM-OMOP integration framework\".","authors":"Woo Yeon Park, Teri Sippel Schmidt, Gabriel Salvador, Kevin O'Donnell, Brad Genereaux, Kyulee Jeon, Seng Chan You, Blake E Dewey, Paul Nagy","doi":"10.1093/jamia/ocaf216","DOIUrl":"10.1093/jamia/ocaf216","url":null,"abstract":"","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"776-778"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981681/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145844527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto
Background: Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.
Materials and methods: GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.
Results: GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.
Discussion: Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.
Conclusion: GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.
{"title":"GARDE-Chat: a scalable, open-source platform for building and deploying health chatbots.","authors":"Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto","doi":"10.1093/jamia/ocaf211","DOIUrl":"10.1093/jamia/ocaf211","url":null,"abstract":"<p><strong>Background: </strong>Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.</p><p><strong>Materials and methods: </strong>GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.</p><p><strong>Results: </strong>GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.</p><p><strong>Discussion: </strong>Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.</p><p><strong>Conclusion: </strong>GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"593-602"},"PeriodicalIF":4.6,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798686/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145953525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}