Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker
Objective: Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.
Materials and methods: We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.
Results: When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.
Conclusion: Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.
{"title":"Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance.","authors":"Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker","doi":"10.1093/jamia/ocaf235","DOIUrl":"10.1093/jamia/ocaf235","url":null,"abstract":"<p><strong>Objective: </strong>Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.</p><p><strong>Materials and methods: </strong>We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.</p><p><strong>Results: </strong>When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.</p><p><strong>Conclusion: </strong>Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.
Objective: We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.
Methods: We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.
Results: Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.
Conclusion: Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.
{"title":"Structural insights into clinical large language models and their barriers to translational readiness.","authors":"Jiwon You, Hangsik Shin","doi":"10.1093/jamia/ocaf230","DOIUrl":"https://doi.org/10.1093/jamia/ocaf230","url":null,"abstract":"<p><strong>Background: </strong>Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.</p><p><strong>Objective: </strong>We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.</p><p><strong>Methods: </strong>We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.</p><p><strong>Results: </strong>Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.</p><p><strong>Conclusion: </strong>Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu
Objectives: To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).
Materials and methods: We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.
Results: Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.
Discussion: While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.
Conclusion: Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.
{"title":"Information extraction from clinical notes: are we ready to switch to large language models?","authors":"Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu","doi":"10.1093/jamia/ocaf213","DOIUrl":"https://doi.org/10.1093/jamia/ocaf213","url":null,"abstract":"<p><strong>Objectives: </strong>To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).</p><p><strong>Materials and methods: </strong>We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.</p><p><strong>Results: </strong>Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.</p><p><strong>Discussion: </strong>While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.</p><p><strong>Conclusion: </strong>Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145985179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto
Background: Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.
Materials and methods: GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.
Results: GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.
Discussion: Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.
Conclusion: GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.
{"title":"GARDE-Chat: a scalable, open-source platform for building and deploying health chatbots.","authors":"Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto","doi":"10.1093/jamia/ocaf211","DOIUrl":"10.1093/jamia/ocaf211","url":null,"abstract":"<p><strong>Background: </strong>Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.</p><p><strong>Materials and methods: </strong>GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.</p><p><strong>Results: </strong>GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.</p><p><strong>Discussion: </strong>Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.</p><p><strong>Conclusion: </strong>GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798686/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145953525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding
Objective: Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.
Materials and methods: SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.
Results: Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.
Discussion: This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.
Conclusion: The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.
{"title":"SDoH-GPT: using large language models to extract social determinants of health.","authors":"Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding","doi":"10.1093/jamia/ocaf094","DOIUrl":"10.1093/jamia/ocaf094","url":null,"abstract":"<p><strong>Objective: </strong>Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.</p><p><strong>Materials and methods: </strong>SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.</p><p><strong>Results: </strong>Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.</p><p><strong>Discussion: </strong>This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.</p><p><strong>Conclusion: </strong>The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"67-78"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758468/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144267837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury
Objective: Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.
Materials and methods: We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.
Results: Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.
Discussion: The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.
Conclusion: Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.
{"title":"Dependence of premature ventricular complexes on heart rate-it's not that simple.","authors":"Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury","doi":"10.1093/jamia/ocaf069","DOIUrl":"10.1093/jamia/ocaf069","url":null,"abstract":"<p><strong>Objective: </strong>Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.</p><p><strong>Materials and methods: </strong>We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.</p><p><strong>Results: </strong>Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.</p><p><strong>Discussion: </strong>The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.</p><p><strong>Conclusion: </strong>Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"90-97"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144055982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objectives: To improve prediction of chronic kidney disease (CKD) progression to end-stage renal disease (ESRD) using machine learning (ML) and deep learning (DL) models applied to integrated clinical and claims data with varying observation windows, supported by explainable artificial intelligence (AI) to enhance interpretability and reduce bias.
Materials and methods: We utilized data from 10 326 CKD patients, combining clinical and claims information from 2009 to 2018. After preprocessing, cohort identification, and feature engineering, we evaluated multiple statistical, ML and DL models using 5 distinct observation windows. Feature importance and SHapley Additive exPlanations (SHAP) analysis were employed to understand key predictors. Models were tested for robustness, clinical relevance, misclassification patterns, and bias.
Results: Integrated data models outperformed single data source models, with long short-term memory achieving the highest area under the receiver operating characteristic curve (AUROC) (0.93) and F1 score (0.65). A 24-month observation window optimally balanced early detection and prediction accuracy. The 2021 estimated glomerular filtration rate (eGFR) equation improved prediction accuracy and reduced racial bias, particularly for African American patients.
Discussion: Improved prediction accuracy, interpretability, and bias mitigation strategies have the potential to enhance CKD management, support targeted interventions, and reduce health-care disparities.
Conclusion: This study presents a robust framework for predicting ESRD outcomes, improving clinical decision-making through integrated multisourced data and advanced analytics. Future research will expand data integration and extend this framework to other chronic diseases.
{"title":"Enhancing end-stage renal disease outcome prediction: a multisourced data-driven approach.","authors":"Yubo Li, Rema Padman","doi":"10.1093/jamia/ocaf118","DOIUrl":"10.1093/jamia/ocaf118","url":null,"abstract":"<p><strong>Objectives: </strong>To improve prediction of chronic kidney disease (CKD) progression to end-stage renal disease (ESRD) using machine learning (ML) and deep learning (DL) models applied to integrated clinical and claims data with varying observation windows, supported by explainable artificial intelligence (AI) to enhance interpretability and reduce bias.</p><p><strong>Materials and methods: </strong>We utilized data from 10 326 CKD patients, combining clinical and claims information from 2009 to 2018. After preprocessing, cohort identification, and feature engineering, we evaluated multiple statistical, ML and DL models using 5 distinct observation windows. Feature importance and SHapley Additive exPlanations (SHAP) analysis were employed to understand key predictors. Models were tested for robustness, clinical relevance, misclassification patterns, and bias.</p><p><strong>Results: </strong>Integrated data models outperformed single data source models, with long short-term memory achieving the highest area under the receiver operating characteristic curve (AUROC) (0.93) and F1 score (0.65). A 24-month observation window optimally balanced early detection and prediction accuracy. The 2021 estimated glomerular filtration rate (eGFR) equation improved prediction accuracy and reduced racial bias, particularly for African American patients.</p><p><strong>Discussion: </strong>Improved prediction accuracy, interpretability, and bias mitigation strategies have the potential to enhance CKD management, support targeted interventions, and reduce health-care disparities.</p><p><strong>Conclusion: </strong>This study presents a robust framework for predicting ESRD outcomes, improving clinical decision-making through integrated multisourced data and advanced analytics. Future research will expand data integration and extend this framework to other chronic diseases.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"26-36"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758457/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shane J Sacco, Kun Chen, Fei Wang, Steven C Rogers, Robert H Aseltine
Objective: Emerging efforts to identify patients at risk of suicide have focused on the development of predictive algorithms for use in healthcare settings. We address a major challenge in effective risk modeling in healthcare settings with insufficient data with which to create and apply risk models. This study aimed to improve risk prediction using transfer learning or data fusion by incorporating risk information from external data sources to augment the data available in particular clinical settings.
Materials and methods: In this retrospective study, we developed predictive models in individual Connecticut hospitals using medical claims data. We compared conventional models containing demographics and historical medical diagnosis codes with fusion models containing conventional features and fused risk information that described similarities in historical diagnosis codes between patients from the hospital and patients receiving care for suicide attempts at other hospitals.
Results: Our sample contained 27 hospitals and 636 758 18- to 64-year-old patients. Fusion improved prediction for 93% of hospitals, while slightly worsening prediction for 7%. Median areas under the ROC and precision-recall curves of conventional models were 77.6% and 3.4%, respectively. Fusion improved these metrics by a median of 3.3 and 0.3 points, respectively (Ps < .001). Median sensitivities and positive predictive values at 90% and 95% specificity were also improved (Ps < .001).
Discussion: This study provided strong evidence that data fusion improved model performance across hospitals. Improvement was of greatest magnitude in facilities treating relatively few suicidal patients.
Conclusion: Data fusion holds promise as a methodology to improve suicide risk prediction in healthcare settings with limited or incomplete data.
{"title":"Using transfer learning to improve prediction of suicide risk in acute care hospitals.","authors":"Shane J Sacco, Kun Chen, Fei Wang, Steven C Rogers, Robert H Aseltine","doi":"10.1093/jamia/ocaf126","DOIUrl":"10.1093/jamia/ocaf126","url":null,"abstract":"<p><strong>Objective: </strong>Emerging efforts to identify patients at risk of suicide have focused on the development of predictive algorithms for use in healthcare settings. We address a major challenge in effective risk modeling in healthcare settings with insufficient data with which to create and apply risk models. This study aimed to improve risk prediction using transfer learning or data fusion by incorporating risk information from external data sources to augment the data available in particular clinical settings.</p><p><strong>Materials and methods: </strong>In this retrospective study, we developed predictive models in individual Connecticut hospitals using medical claims data. We compared conventional models containing demographics and historical medical diagnosis codes with fusion models containing conventional features and fused risk information that described similarities in historical diagnosis codes between patients from the hospital and patients receiving care for suicide attempts at other hospitals.</p><p><strong>Results: </strong>Our sample contained 27 hospitals and 636 758 18- to 64-year-old patients. Fusion improved prediction for 93% of hospitals, while slightly worsening prediction for 7%. Median areas under the ROC and precision-recall curves of conventional models were 77.6% and 3.4%, respectively. Fusion improved these metrics by a median of 3.3 and 0.3 points, respectively (Ps < .001). Median sensitivities and positive predictive values at 90% and 95% specificity were also improved (Ps < .001).</p><p><strong>Discussion: </strong>This study provided strong evidence that data fusion improved model performance across hospitals. Improvement was of greatest magnitude in facilities treating relatively few suicidal patients.</p><p><strong>Conclusion: </strong>Data fusion holds promise as a methodology to improve suicide risk prediction in healthcare settings with limited or incomplete data.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"159-166"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144715164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dilruk Perera, Siqi Liu, Kay Choong See, Mengling Feng
Objectives: This study introduces Smart Imitator (SI), a 2-phase reinforcement learning (RL) solution enhancing personalized treatment policies in healthcare, addressing challenges from imperfect clinician data and complex environments.
Materials and methods: Smart Imitator's first phase uses adversarial cooperative imitation learning with a novel sample selection schema to categorize clinician policies from optimal to nonoptimal. The second phase creates a parameterized reward function to guide the learning of superior treatment policies through RL. Smart Imitator's effectiveness was validated on 2 datasets: a sepsis dataset with 19 711 patient trajectories and a diabetes dataset with 7234 trajectories.
Results: Extensive quantitative and qualitative experiments showed that SI significantly outperformed state-of-the-art baselines in both datasets. For sepsis, SI reduced estimated mortality rates by 19.6% compared to the best baseline. For diabetes, SI reduced HbA1c-High rates by 12.2%. The learned policies aligned closely with successful clinical decisions and deviated strategically when necessary. These deviations aligned with recent clinical findings, suggesting improved outcomes.
Discussion: Smart Imitator advances RL applications by addressing challenges such as imperfect data and environmental complexities, demonstrating effectiveness within the tested conditions of sepsis and diabetes. Further validation across diverse conditions and exploration of additional RL algorithms are needed to enhance precision and generalizability.
Conclusion: This study shows potential in advancing personalized healthcare learning from clinician behaviors to improve treatment outcomes. Its methodology offers a robust approach for adaptive, personalized strategies in various complex and uncertain environments.
{"title":"Smart Imitator: Learning from Imperfect Clinical Decisions.","authors":"Dilruk Perera, Siqi Liu, Kay Choong See, Mengling Feng","doi":"10.1093/jamia/ocae320","DOIUrl":"10.1093/jamia/ocae320","url":null,"abstract":"<p><strong>Objectives: </strong>This study introduces Smart Imitator (SI), a 2-phase reinforcement learning (RL) solution enhancing personalized treatment policies in healthcare, addressing challenges from imperfect clinician data and complex environments.</p><p><strong>Materials and methods: </strong>Smart Imitator's first phase uses adversarial cooperative imitation learning with a novel sample selection schema to categorize clinician policies from optimal to nonoptimal. The second phase creates a parameterized reward function to guide the learning of superior treatment policies through RL. Smart Imitator's effectiveness was validated on 2 datasets: a sepsis dataset with 19 711 patient trajectories and a diabetes dataset with 7234 trajectories.</p><p><strong>Results: </strong>Extensive quantitative and qualitative experiments showed that SI significantly outperformed state-of-the-art baselines in both datasets. For sepsis, SI reduced estimated mortality rates by 19.6% compared to the best baseline. For diabetes, SI reduced HbA1c-High rates by 12.2%. The learned policies aligned closely with successful clinical decisions and deviated strategically when necessary. These deviations aligned with recent clinical findings, suggesting improved outcomes.</p><p><strong>Discussion: </strong>Smart Imitator advances RL applications by addressing challenges such as imperfect data and environmental complexities, demonstrating effectiveness within the tested conditions of sepsis and diabetes. Further validation across diverse conditions and exploration of additional RL algorithms are needed to enhance precision and generalizability.</p><p><strong>Conclusion: </strong>This study shows potential in advancing personalized healthcare learning from clinician behaviors to improve treatment outcomes. Its methodology offers a robust approach for adaptive, personalized strategies in various complex and uncertain environments.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"49-66"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758472/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142962554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng-Han Tsai, Sung-Chu Ko, Amy Huaishiuan Huang, Lorenzo Porta, Cecilia Ferretti, Clarissa Longhi, Wan-Ting Hsu, Yung-Han Chang, Jo-Ching Hsiung, Chin-Hua Su, Filippo Galbiati, Chien-Chang Lee
Objectives: To pioneer the first artificial intelligence system integrating radiological and objective clinical data, simulating the clinical reasoning process, for the early prediction of high-risk influenza patients.
Materials and methods: Our system was developed using a cohort from National Taiwan University Hospital in Taiwan, with external validation data from ASST Grande Ospedale Metropolitano Niguarda in Italy. Convolutional neural networks pretrained on ImageNet were regressively trained using a 5-point scale to develop the influenza chest X-ray (CXR) severity scoring model, FluDeep-XR. Early, late, and joint fusion structures, incorporating varying weights of CXR severity with clinical data, were designed to predict 30-day mortality and compared with models using only CXR or clinical data. The best-performing model was designated as FluDeep. The explainability of FluDeep-XR and FluDeep was illustrated through activation maps and SHapley Additive exPlanations (SHAP).
Results: The Xception-based model, FluDeep-XR, achieved a mean square error of 0.738 in the external validation dataset. The Random Forest-based late fusion model, FluDeep, outperformed all the other models, achieving an area under the receiver operating curve of 0.818 and a sensitivity of 0.706 in the external dataset. Activation maps highlighted clear lung fields. Shapley additive explanations identified age, C-reactive protein, hematocrit, heart rate, and respiratory rate as the top 5 important clinical features.
Discussion: The integration of medical imaging with objective clinical data outperformed single-modality models to predict 30-day mortality in influenza patients. We ensured the explainability of our models aligned with clinical knowledge and validated its applicability across foreign institutions.
Conclusion: FluDeep highlights the potential of combining radiological and clinical information in late fusion design, enhancing diagnostic accuracy and offering an explainable, and generalizable decision support system.
{"title":"Predicting mortality in hospitalized influenza patients: integration of deep learning-based chest X-ray severity score (FluDeep-XR) and clinical variables.","authors":"Meng-Han Tsai, Sung-Chu Ko, Amy Huaishiuan Huang, Lorenzo Porta, Cecilia Ferretti, Clarissa Longhi, Wan-Ting Hsu, Yung-Han Chang, Jo-Ching Hsiung, Chin-Hua Su, Filippo Galbiati, Chien-Chang Lee","doi":"10.1093/jamia/ocae286","DOIUrl":"10.1093/jamia/ocae286","url":null,"abstract":"<p><strong>Objectives: </strong>To pioneer the first artificial intelligence system integrating radiological and objective clinical data, simulating the clinical reasoning process, for the early prediction of high-risk influenza patients.</p><p><strong>Materials and methods: </strong>Our system was developed using a cohort from National Taiwan University Hospital in Taiwan, with external validation data from ASST Grande Ospedale Metropolitano Niguarda in Italy. Convolutional neural networks pretrained on ImageNet were regressively trained using a 5-point scale to develop the influenza chest X-ray (CXR) severity scoring model, FluDeep-XR. Early, late, and joint fusion structures, incorporating varying weights of CXR severity with clinical data, were designed to predict 30-day mortality and compared with models using only CXR or clinical data. The best-performing model was designated as FluDeep. The explainability of FluDeep-XR and FluDeep was illustrated through activation maps and SHapley Additive exPlanations (SHAP).</p><p><strong>Results: </strong>The Xception-based model, FluDeep-XR, achieved a mean square error of 0.738 in the external validation dataset. The Random Forest-based late fusion model, FluDeep, outperformed all the other models, achieving an area under the receiver operating curve of 0.818 and a sensitivity of 0.706 in the external dataset. Activation maps highlighted clear lung fields. Shapley additive explanations identified age, C-reactive protein, hematocrit, heart rate, and respiratory rate as the top 5 important clinical features.</p><p><strong>Discussion: </strong>The integration of medical imaging with objective clinical data outperformed single-modality models to predict 30-day mortality in influenza patients. We ensured the explainability of our models aligned with clinical knowledge and validated its applicability across foreign institutions.</p><p><strong>Conclusion: </strong>FluDeep highlights the potential of combining radiological and clinical information in late fusion design, enhancing diagnostic accuracy and offering an explainable, and generalizable decision support system.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"133-143"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142689371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}