首页 > 最新文献

Journal of the American Medical Informatics Association最新文献

英文 中文
Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance. 审计师模型可以抑制人工智能预测的不佳,从而提高人类与人工智能的协作性能。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-13 DOI: 10.1093/jamia/ocaf235
Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker

Objective: Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.

Materials and methods: We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.

Results: When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.

Conclusion: Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.

目的:医疗保健决策越来越多地在机器学习(ML)的帮助下做出。已知ML具有不公平性-跨亚群的结果不一致。与这些系统互动的临床医生可能会因过度依赖而使这种不公平永久化。最近研究机器学习抑制的工作-基于审计机器学习的沉默预测-显示出减轻过度依赖引起的性能问题的希望。本研究旨在评估抑制对协作公平性的影响,并评估ML不确定性作为审计ML所需的数据。材料和方法:我们使用范德比尔特大学医学中心电子健康记录(n = 58 817)和mimic -ⅳ- ed数据集(n = 363 145)的数据,使用梯度增强树和人工高性能oracle模型来预测死亡或重症监护单位转移的可能性以及30天再入院的可能性。我们直接从数据集中得出临床医生的决策,并基于先前接受临床决策支持警报的经验工作模拟临床医生对ML预测的接受程度。我们用接收器工作特性曲线下的面积和使用绝对平均赔率差的算法公平性来测量性能。结果:当机器学习的表现优于人类时,抑制的表现优于单独的人类(P结论:通过审计师模型抑制低质量的机器学习预测,有望提高人类与人工智能的协作性能和公平性。
{"title":"Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance.","authors":"Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker","doi":"10.1093/jamia/ocaf235","DOIUrl":"10.1093/jamia/ocaf235","url":null,"abstract":"<p><strong>Objective: </strong>Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.</p><p><strong>Materials and methods: </strong>We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.</p><p><strong>Results: </strong>When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.</p><p><strong>Conclusion: </strong>Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structural insights into clinical large language models and their barriers to translational readiness. 对临床大型语言模型的结构见解及其对翻译准备的障碍。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-11 DOI: 10.1093/jamia/ocaf230
Jiwon You, Hangsik Shin

Background: Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.

Objective: We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.

Methods: We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.

Results: Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.

Conclusion: Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.

背景:尽管临床大语言模型(llm)快速融入临床决策,但由于结构表征不足和外部验证有限,临床大语言模型(llm)面临着巨大的翻译障碍。目的:我们系统地绘制临床法学硕士研究景观,以确定影响其准备为现实世界的临床部署的关键结构模式。方法:我们使用结构化证据图谱方法,确定了2020年1月至2025年3月期间发表的73项临床法学硕士研究。为了确保研究选择的透明度和可重复性,我们遵循了PRISMA 2020框架中的关键原则。每项研究按临床任务、基础架构、对齐策略、数据类型、语言、研究设计、验证方法和评估指标进行分类。结果:研究通常涉及多个早期临床任务-问答(56.2%),知识结构(31.5%)和疾病预测(43.8%)-主要使用文本数据(52.1%)和英语资源(80.8%)。GPT模型倾向于检索增强生成(43.8%),LLaMA模型一贯采用多阶段预训练和微调策略。只有6.9%的研究包括外部验证,前瞻性设计仅在4.1%的病例中观察到,反映了翻译可靠性的显著差距。评估主要是定量的(79.5%),尽管定性和混合方法越来越多地被认可为评估临床可用性和可信度。结论:临床法学硕士研究仍然是探索性的,在语言、数据类型和临床环境方面的通用性有限。为了弥补这一差距,未来的研究必须优先考虑多语言和多模式的培训,前瞻性研究设计与严格的外部验证,以及结合定量表现和定性临床可用性指标的混合评估框架。
{"title":"Structural insights into clinical large language models and their barriers to translational readiness.","authors":"Jiwon You, Hangsik Shin","doi":"10.1093/jamia/ocaf230","DOIUrl":"https://doi.org/10.1093/jamia/ocaf230","url":null,"abstract":"<p><strong>Background: </strong>Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.</p><p><strong>Objective: </strong>We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.</p><p><strong>Methods: </strong>We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.</p><p><strong>Results: </strong>Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.</p><p><strong>Conclusion: </strong>Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Information extraction from clinical notes: are we ready to switch to large language models? 从临床记录中提取信息:我们准备好转向大型语言模型了吗?
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-10 DOI: 10.1093/jamia/ocaf213
Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu

Objectives: To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).

Materials and methods: We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.

Results: Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.

Discussion: While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.

Conclusion: Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.

目的:评估指令调谐大型语言模型元AI (LLaMA)-2和LLaMA-3模型在临床信息提取(IE)任务,特别是命名实体识别(NER)和关系提取(RE)中的性能、通用性和计算效率,并与来自变压器的双向编码器表示(BERT)进行比较。材料和方法:我们从4个数据源——ut医师(UTP)(1342个笔记)、转录医学转录样本报告和示例(MTSamples)(146个)、重症监护医学信息市场(MIMIC)-III(50个)和整合生物学和床边信息学(i2b2)(50个)——开发了一个综合注释的1588个临床笔记的数据库,捕获了4个临床实体(问题、测试、药物、其他治疗)和16个修饰词(例如,否定、确定性)。大型语言模型Meta AI-2和LLaMA-3对临床NER和RE进行了指令调整,并以BERT为基准进行了性能测试。结果:大型语言模型元人工智能模型在数据集上始终优于BERT。在数据丰富的环境中(如UTP), LLaMA取得了边际收益(NER改善约1%,RE改善1.5%-3.7%)。在有限的数据条件下(例如,MTSamples, MIMIC-III)和不可见的i2b2数据集,LLaMA-3- 70b提高F1分数超过7%的NER和4%的RE。然而,性能的提高伴随着计算成本的增加,LLaMA模型需要更多的内存和图形处理单元(GPU)小时,运行速度比BERT慢28倍。讨论:虽然LLaMA模型提供了增强的性能,但它们更高的计算需求和更慢的吞吐量突出了平衡性能与实际资源约束的需要。当在llm和BERT之间选择临床IE时,特定应用的考虑是必不可少的。结论:指令调整的LLaMA模型在临床NER和RE任务中表现出良好的前景。但是,必须仔细评估改进的性能和增加的计算成本之间的权衡。我们发布了我们的Kiwi包(https://kiwi.clinicalnlp.org/),以促进LLaMA和BERT模型在临床IE应用中的应用。
{"title":"Information extraction from clinical notes: are we ready to switch to large language models?","authors":"Yan Hu, Xu Zuo, Yujia Zhou, Xueqing Peng, Jimin Huang, Vipina K Keloth, Vincent J Zhang, Ruey-Ling Weng, Cathy Shyr, Qingyu Chen, Xiaoqian Jiang, Kirk E Roberts, Hua Xu","doi":"10.1093/jamia/ocaf213","DOIUrl":"https://doi.org/10.1093/jamia/ocaf213","url":null,"abstract":"<p><strong>Objectives: </strong>To assess the performance, generalizability, and computational efficiency of instruction-tuned Large Language Model Meta AI (LLaMA)-2 and LLaMA-3 models compared to bidirectional encoder representations from transformers (BERT) for clinical information extraction (IE) tasks, specifically named entity recognition (NER) and relation extraction (RE).</p><p><strong>Materials and methods: </strong>We developed a comprehensive annotated corpus of 1588 clinical notes from 4 data sources-UT Physicians (UTP) (1342 notes), Transcribed Medical Transcription Sample Reports and Examples (MTSamples) (146), Medical Information Mart for Intensive Care (MIMIC)-III (50), and Informatics for Integrating Biology and the Bedside (i2b2) (50), capturing 4 clinical entities (problems, tests, medications, other treatments) and 16 modifiers (eg, negation, certainty). Large Language Model Meta AI-2 and LLaMA-3 were instruction-tuned for clinical NER and RE, and their performance was benchmarked against BERT.</p><p><strong>Results: </strong>Large Language Model Meta AI models consistently outperformed BERT across datasets. In data-rich settings (eg, UTP), LLaMA achieved marginal gains (approximately 1% improvement for NER and 1.5%-3.7% for RE). Under limited data conditions (eg, MTSamples, MIMIC-III) and on the unseen i2b2 dataset, LLaMA-3-70B improved F1 scores by over 7% for NER and 4% for RE. However, performance gains came with increased computational costs, with LLaMA models requiring more memory and Graphics Processing Unit (GPU) hours and running up to 28 times slower than BERT.</p><p><strong>Discussion: </strong>While LLaMA models offer enhanced performance, their higher computational demands and slower throughput highlight the need to balance performance with practical resource constraints. Application-specific considerations are essential when choosing between LLMs and BERT for clinical IE.</p><p><strong>Conclusion: </strong>Instruction-tuned LLaMA models show promise for clinical NER and RE tasks. However, the tradeoff between improved performance and increased computational cost must be carefully evaluated. We release our Kiwi package (https://kiwi.clinicalnlp.org/) to facilitate the application of both LLaMA and BERT models in clinical IE applications.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145985179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GARDE-Chat: a scalable, open-source platform for building and deploying health chatbots. 一个可扩展的开源平台,用于构建和部署健康聊天机器人。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-10 DOI: 10.1093/jamia/ocaf211
Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto

Background: Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.

Materials and methods: GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.

Results: GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.

Discussion: Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.

Conclusion: GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.

背景:聊天机器人越来越多地用于提供健康教育、患者参与和获得医疗保健服务。GARDE-Chat是一个开源平台,旨在促进基于聊天机器人的数字健康干预措施在不同领域和环境中的开发、部署和传播。材料和方法:gard - chat是通过一个迭代过程开发的,该过程由现实世界的用例提供信息,以指导关键功能的优先级。该工具是作为一个开源平台开发的,旨在促进跨研究和临床领域的协作、广泛传播和影响。结果:gard - chat的主要功能包括:(1)一个可视化的创作界面,允许非程序员设计聊天机器人;(2)支持脚本化、基于大语言模型(LLM)和混合聊天机器人;(3)与研究人员和机构共享聊天机器人的能力;(4)与外部应用程序和数据源集成,如电子健康记录和REDCap;(5)通过网页浏览器或短信传送;(6)支持聊天机器人用户交互分析的详细审计日志。自2022年7月首次发布以来,gard - chat已支持基于聊天机器人的干预措施的开发,并在多项研究中进行了测试,包括针对基因检测、COVID-19检测、戒烟和癌症筛查等主题的大型实用临床试验。讨论:正在进行的挑战包括开发聊天机器人脚本所需的努力,确保llm的安全使用,以及与临床系统的集成。结论:gard - chat是一个创建、实施和传播可扩展的基于聊天机器人的人口健康干预的通用平台。它已经在几项研究中得到验证,并且通过开源机制可供研究人员和医疗保健系统使用。
{"title":"GARDE-Chat: a scalable, open-source platform for building and deploying health chatbots.","authors":"Guilherme Del Fiol, Emerson Borsato, Richard L Bradshaw, Jiantao Bian, Alana Woodbury, Courtney Gauchel, Karen L Eilbeck, Whitney Maxwell, Kelsey Ellis, Anne C Madeo, Chelsey Schlechter, Polina V Kukhareva, Caitlin G Allen, Michael Kean, Elena B Elkin, Ravi Sharaf, Muhammad D Ahsan, Melissa Frey, Lauren Davis-Rivera, Wendy K Kohlmann, David W Wetter, Kimberly A Kaphingst, Kensaku Kawamoto","doi":"10.1093/jamia/ocaf211","DOIUrl":"10.1093/jamia/ocaf211","url":null,"abstract":"<p><strong>Background: </strong>Chatbots are increasingly used to deliver health education, patient engagement, and access to healthcare services. GARDE-Chat is an open-source platform designed to facilitate the development, deployment, and dissemination of chatbot-based digital health interventions across different domains and settings.</p><p><strong>Materials and methods: </strong>GARDE-Chat was developed through an iterative process informed by real-world use cases to guide prioritization of key features. The tool was developed as an open-source platform to promote collaboration, broad dissemination, and impact across research and clinical domains.</p><p><strong>Results: </strong>GARDE-Chat's main features include (1) a visual authoring interface that allows non-programmers to design chatbots; (2) support for scripted, large language model (LLM)-based and hybrid chatbots; (3) capacity to share chatbots with researchers and institutions; (4) integration with external applications and data sources such as electronic health records and REDCap; (5) delivery via web browsers or text messaging; and (6) detailed audit log supporting analyses of chatbot user interactions. Since its first release in July 2022, GARDE-Chat has supported the development of chatbot-based interventions tested in multiple studies, including large pragmatic clinical trials addressing topics such as genetic testing, COVID-19 testing, tobacco cessation, and cancer screening.</p><p><strong>Discussion: </strong>Ongoing challenges include the effort required for developing chatbot scripts, ensuring safe use of LLMs, and integrating with clinical systems.</p><p><strong>Conclusion: </strong>GARDE-Chat is a generalizable platform for creating, implementing, and disseminating scalable chatbot-based population health interventions. It has been validated in several studies, and it is available to researchers and healthcare systems through an open-source mechanism.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798686/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145953525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SDoH-GPT: using large language models to extract social determinants of health. SDoH-GPT:使用大型语言模型提取健康的社会决定因素。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 DOI: 10.1093/jamia/ocaf094
Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding

Objective: Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.

Materials and methods: SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.

Results: Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.

Discussion: This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.

Conclusion: The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.

目的:从医疗记录中提取健康的社会决定因素(SDoHs)在很大程度上依赖于劳动密集型的注释,这些注释通常是特定于任务的,阻碍了可重用性并限制了共享。在这里,我们介绍了SDoH- gpt,这是一个利用少量学习大型语言模型(llm)从非结构化文本中自动提取SDoH的新框架,旨在提高效率和泛化性。材料和方法:SDoH- gpt是一个框架,包括从医疗记录中提取SDoH的few-shot learning LLM方法,以及使用few-shot learning LLM方法生成的注释作为训练数据集继续对SDoH进行分类的XGBoost分类器。少镜头学习LLM方法与XGBoost的独特结合利用了LLM作为少镜头学习器的强度和XGBoost在训练数据集足够时的效率。因此,SDoH- gpt可以在不依赖大量医学注释或昂贵的人为干预的情况下提取SDoH。结果:我们的方法在时间和成本上分别减少了10倍和20倍,并且与人类注释器的一致性非常好,Cohen的kappa测量值高达0.92。LLM和XGBoost的创新组合可以确保高精度和计算效率,同时始终保持0.90+ AUROC分数。讨论:本研究在三个数据集上验证了SDoH-GPT,并强调了利用LLM和XGBoost彻底改变医疗记录分类的潜力,展示了其在显著减少时间和成本的情况下实现高度准确分类的能力。结论:本研究的关键贡献在于LLM与XGBoost的集成,实现了高成本、高质量的SDoH注释。这项研究为SDoH在推动未来医疗保健解决方案方面更易于访问、可扩展和更有影响力奠定了基础。
{"title":"SDoH-GPT: using large language models to extract social determinants of health.","authors":"Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding","doi":"10.1093/jamia/ocaf094","DOIUrl":"10.1093/jamia/ocaf094","url":null,"abstract":"<p><strong>Objective: </strong>Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.</p><p><strong>Materials and methods: </strong>SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.</p><p><strong>Results: </strong>Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.</p><p><strong>Discussion: </strong>This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.</p><p><strong>Conclusion: </strong>The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"67-78"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758468/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144267837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dependence of premature ventricular complexes on heart rate-it's not that simple. 早衰心室复合体对心率的依赖——没那么简单。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 DOI: 10.1093/jamia/ocaf069
Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury

Objective: Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.

Materials and methods: We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.

Results: Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.

Discussion: The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.

Conclusion: Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.

目的:频繁的室性早搏可导致不良的健康状况,如心肌病。在24小时动态心电图记录中,PVC频率与心率(阳性、阴性或中性)之间的线性相关性已被提出作为对患者进行分类和指导β受体阻滞剂治疗的一种方法。我们的目的是评估这种分类对测量方法、不同的24小时周期和室性早搏对心率的非线性依赖性的稳健性。材料和方法:我们分析了48例频繁室性早搏患者(负担1%-44%)的82天动态心电图记录(1-7天)。对于每一个记录,在不同的24小时周期内计算PVC频率与心率之间的线性相关性,并使用不同的长度间隔来确定PVC频率。结果:使用1小时的间隔,只有36.6%的患者在不同的日子里,PVC频率和心率之间的相关性始终为正、负或中性。使用较短的时间间隔,56.1%的患者的相关性是一致的。较短的时间间隔揭示了许多患者PVC频率与心率之间的非线性和分段线性关系。讨论:在不同的24小时周期和间隔时间内,PVC频率和心率之间的相关性的可变性表明,这种关系既不是严格的线性关系,也不是平稳的关系。更好地理解驱动pvc的机制,结合代表这些机制的计算和生物学模型,可以提供对观察到的非线性行为的洞察,并指导更稳健的分类策略。结论:线性相关性作为诊断频发室性早搏的工具应谨慎使用。它对分析的特定24小时期间和用于分割数据的方法很敏感。应该在临床实践中开发和考虑更复杂的分类方法,这些方法可以捕获非线性和时变的依赖关系。
{"title":"Dependence of premature ventricular complexes on heart rate-it's not that simple.","authors":"Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury","doi":"10.1093/jamia/ocaf069","DOIUrl":"10.1093/jamia/ocaf069","url":null,"abstract":"<p><strong>Objective: </strong>Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.</p><p><strong>Materials and methods: </strong>We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.</p><p><strong>Results: </strong>Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.</p><p><strong>Discussion: </strong>The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.</p><p><strong>Conclusion: </strong>Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"90-97"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144055982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing end-stage renal disease outcome prediction: a multisourced data-driven approach. 加强终末期肾脏疾病结局预测:多来源数据驱动的方法
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 DOI: 10.1093/jamia/ocaf118
Yubo Li, Rema Padman

Objectives: To improve prediction of chronic kidney disease (CKD) progression to end-stage renal disease (ESRD) using machine learning (ML) and deep learning (DL) models applied to integrated clinical and claims data with varying observation windows, supported by explainable artificial intelligence (AI) to enhance interpretability and reduce bias.

Materials and methods: We utilized data from 10 326 CKD patients, combining clinical and claims information from 2009 to 2018. After preprocessing, cohort identification, and feature engineering, we evaluated multiple statistical, ML and DL models using 5 distinct observation windows. Feature importance and SHapley Additive exPlanations (SHAP) analysis were employed to understand key predictors. Models were tested for robustness, clinical relevance, misclassification patterns, and bias.

Results: Integrated data models outperformed single data source models, with long short-term memory achieving the highest area under the receiver operating characteristic curve (AUROC) (0.93) and F1 score (0.65). A 24-month observation window optimally balanced early detection and prediction accuracy. The 2021 estimated glomerular filtration rate (eGFR) equation improved prediction accuracy and reduced racial bias, particularly for African American patients.

Discussion: Improved prediction accuracy, interpretability, and bias mitigation strategies have the potential to enhance CKD management, support targeted interventions, and reduce health-care disparities.

Conclusion: This study presents a robust framework for predicting ESRD outcomes, improving clinical decision-making through integrated multisourced data and advanced analytics. Future research will expand data integration and extend this framework to other chronic diseases.

目的:利用机器学习(ML)和深度学习(DL)模型,提高慢性肾脏疾病(CKD)进展到终末期肾脏疾病(ESRD)的预测,这些模型应用于具有不同观察窗口的综合临床和索赔数据,并得到可解释的人工智能(AI)的支持,以增强可解释性并减少偏差。材料和方法:我们利用2009年至2018年10 326例CKD患者的数据,结合临床和索赔信息。经过预处理、队列识别和特征工程,我们使用5个不同的观察窗口评估了多个统计、ML和DL模型。采用特征重要性和SHapley加性解释(SHAP)分析来了解关键预测因子。对模型进行稳健性、临床相关性、错误分类模式和偏倚检验。结果:综合数据模型优于单一数据源模型,长短期记忆在受试者工作特征曲线下面积(AUROC)最高(0.93),F1得分最高(0.65)。24个月的观测窗口最佳地平衡了早期发现和预测精度。2021年估计的肾小球滤过率(eGFR)方程提高了预测准确性,减少了种族偏见,特别是对非洲裔美国患者。讨论:提高预测准确性、可解释性和减轻偏倚策略有可能加强CKD管理,支持有针对性的干预措施,并减少医疗保健差距。结论:本研究为预测ESRD结果提供了一个强大的框架,通过集成多源数据和高级分析改善临床决策。未来的研究将扩大数据整合,并将这一框架扩展到其他慢性疾病。
{"title":"Enhancing end-stage renal disease outcome prediction: a multisourced data-driven approach.","authors":"Yubo Li, Rema Padman","doi":"10.1093/jamia/ocaf118","DOIUrl":"10.1093/jamia/ocaf118","url":null,"abstract":"<p><strong>Objectives: </strong>To improve prediction of chronic kidney disease (CKD) progression to end-stage renal disease (ESRD) using machine learning (ML) and deep learning (DL) models applied to integrated clinical and claims data with varying observation windows, supported by explainable artificial intelligence (AI) to enhance interpretability and reduce bias.</p><p><strong>Materials and methods: </strong>We utilized data from 10 326 CKD patients, combining clinical and claims information from 2009 to 2018. After preprocessing, cohort identification, and feature engineering, we evaluated multiple statistical, ML and DL models using 5 distinct observation windows. Feature importance and SHapley Additive exPlanations (SHAP) analysis were employed to understand key predictors. Models were tested for robustness, clinical relevance, misclassification patterns, and bias.</p><p><strong>Results: </strong>Integrated data models outperformed single data source models, with long short-term memory achieving the highest area under the receiver operating characteristic curve (AUROC) (0.93) and F1 score (0.65). A 24-month observation window optimally balanced early detection and prediction accuracy. The 2021 estimated glomerular filtration rate (eGFR) equation improved prediction accuracy and reduced racial bias, particularly for African American patients.</p><p><strong>Discussion: </strong>Improved prediction accuracy, interpretability, and bias mitigation strategies have the potential to enhance CKD management, support targeted interventions, and reduce health-care disparities.</p><p><strong>Conclusion: </strong>This study presents a robust framework for predicting ESRD outcomes, improving clinical decision-making through integrated multisourced data and advanced analytics. Future research will expand data integration and extend this framework to other chronic diseases.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"26-36"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758457/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using transfer learning to improve prediction of suicide risk in acute care hospitals. 运用迁移学习改善急症护理医院自杀风险预测。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 DOI: 10.1093/jamia/ocaf126
Shane J Sacco, Kun Chen, Fei Wang, Steven C Rogers, Robert H Aseltine

Objective: Emerging efforts to identify patients at risk of suicide have focused on the development of predictive algorithms for use in healthcare settings. We address a major challenge in effective risk modeling in healthcare settings with insufficient data with which to create and apply risk models. This study aimed to improve risk prediction using transfer learning or data fusion by incorporating risk information from external data sources to augment the data available in particular clinical settings.

Materials and methods: In this retrospective study, we developed predictive models in individual Connecticut hospitals using medical claims data. We compared conventional models containing demographics and historical medical diagnosis codes with fusion models containing conventional features and fused risk information that described similarities in historical diagnosis codes between patients from the hospital and patients receiving care for suicide attempts at other hospitals.

Results: Our sample contained 27 hospitals and 636 758 18- to 64-year-old patients. Fusion improved prediction for 93% of hospitals, while slightly worsening prediction for 7%. Median areas under the ROC and precision-recall curves of conventional models were 77.6% and 3.4%, respectively. Fusion improved these metrics by a median of 3.3 and 0.3 points, respectively (Ps < .001). Median sensitivities and positive predictive values at 90% and 95% specificity were also improved (Ps < .001).

Discussion: This study provided strong evidence that data fusion improved model performance across hospitals. Improvement was of greatest magnitude in facilities treating relatively few suicidal patients.

Conclusion: Data fusion holds promise as a methodology to improve suicide risk prediction in healthcare settings with limited or incomplete data.

目的:新兴的努力,以确定有自杀风险的病人已经集中在预测算法的发展,用于医疗保健设置。我们解决了在医疗保健环境中有效风险建模的主要挑战,因为数据不足,无法创建和应用风险模型。本研究旨在通过整合来自外部数据源的风险信息来增强特定临床环境中可用的数据,从而利用迁移学习或数据融合来改进风险预测。材料和方法:在这项回顾性研究中,我们利用医疗索赔数据开发了康涅狄格州各医院的预测模型。我们将包含人口统计学和历史医学诊断代码的传统模型与包含传统特征和融合风险信息的融合模型进行了比较,融合风险信息描述了来自医院的患者和在其他医院接受治疗的自杀未遂患者之间历史诊断代码的相似性。结果:我们的样本包括27家医院和636758名18至64岁的患者。融合提高了93%的医院的预测,而7%的医院的预测略有下降。常规模型的ROC曲线下的中位数面积为77.6%,准确率-召回率曲线下的中位数面积为3.4%。融合将这些指标分别提高了3.3和0.3分(Ps讨论:本研究提供了强有力的证据,表明数据融合提高了医院的模型性能。在治疗相对较少自杀患者的设施中,改善幅度最大。结论:数据融合有望作为一种方法,在数据有限或不完整的医疗机构中改善自杀风险预测。
{"title":"Using transfer learning to improve prediction of suicide risk in acute care hospitals.","authors":"Shane J Sacco, Kun Chen, Fei Wang, Steven C Rogers, Robert H Aseltine","doi":"10.1093/jamia/ocaf126","DOIUrl":"10.1093/jamia/ocaf126","url":null,"abstract":"<p><strong>Objective: </strong>Emerging efforts to identify patients at risk of suicide have focused on the development of predictive algorithms for use in healthcare settings. We address a major challenge in effective risk modeling in healthcare settings with insufficient data with which to create and apply risk models. This study aimed to improve risk prediction using transfer learning or data fusion by incorporating risk information from external data sources to augment the data available in particular clinical settings.</p><p><strong>Materials and methods: </strong>In this retrospective study, we developed predictive models in individual Connecticut hospitals using medical claims data. We compared conventional models containing demographics and historical medical diagnosis codes with fusion models containing conventional features and fused risk information that described similarities in historical diagnosis codes between patients from the hospital and patients receiving care for suicide attempts at other hospitals.</p><p><strong>Results: </strong>Our sample contained 27 hospitals and 636 758 18- to 64-year-old patients. Fusion improved prediction for 93% of hospitals, while slightly worsening prediction for 7%. Median areas under the ROC and precision-recall curves of conventional models were 77.6% and 3.4%, respectively. Fusion improved these metrics by a median of 3.3 and 0.3 points, respectively (Ps < .001). Median sensitivities and positive predictive values at 90% and 95% specificity were also improved (Ps < .001).</p><p><strong>Discussion: </strong>This study provided strong evidence that data fusion improved model performance across hospitals. Improvement was of greatest magnitude in facilities treating relatively few suicidal patients.</p><p><strong>Conclusion: </strong>Data fusion holds promise as a methodology to improve suicide risk prediction in healthcare settings with limited or incomplete data.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"159-166"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144715164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Smart Imitator: Learning from Imperfect Clinical Decisions. 聪明的模仿者:从不完美的临床决策中学习。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 DOI: 10.1093/jamia/ocae320
Dilruk Perera, Siqi Liu, Kay Choong See, Mengling Feng

Objectives: This study introduces Smart Imitator (SI), a 2-phase reinforcement learning (RL) solution enhancing personalized treatment policies in healthcare, addressing challenges from imperfect clinician data and complex environments.

Materials and methods: Smart Imitator's first phase uses adversarial cooperative imitation learning with a novel sample selection schema to categorize clinician policies from optimal to nonoptimal. The second phase creates a parameterized reward function to guide the learning of superior treatment policies through RL. Smart Imitator's effectiveness was validated on 2 datasets: a sepsis dataset with 19 711 patient trajectories and a diabetes dataset with 7234 trajectories.

Results: Extensive quantitative and qualitative experiments showed that SI significantly outperformed state-of-the-art baselines in both datasets. For sepsis, SI reduced estimated mortality rates by 19.6% compared to the best baseline. For diabetes, SI reduced HbA1c-High rates by 12.2%. The learned policies aligned closely with successful clinical decisions and deviated strategically when necessary. These deviations aligned with recent clinical findings, suggesting improved outcomes.

Discussion: Smart Imitator advances RL applications by addressing challenges such as imperfect data and environmental complexities, demonstrating effectiveness within the tested conditions of sepsis and diabetes. Further validation across diverse conditions and exploration of additional RL algorithms are needed to enhance precision and generalizability.

Conclusion: This study shows potential in advancing personalized healthcare learning from clinician behaviors to improve treatment outcomes. Its methodology offers a robust approach for adaptive, personalized strategies in various complex and uncertain environments.

目的:本研究介绍了智能模仿者(SI),这是一种两阶段强化学习(RL)解决方案,可增强医疗保健中的个性化治疗政策,解决临床医生数据不完善和复杂环境带来的挑战。材料和方法:智能模仿者的第一阶段使用对抗性合作模仿学习和一种新的样本选择模式,将临床医生的策略从最优到非最优进行分类。第二阶段创建一个参数化的奖励函数,通过强化学习来指导更好的待遇政策的学习。Smart Imitator的有效性在2个数据集上得到了验证:脓毒症数据集(包含19711个患者轨迹)和糖尿病数据集(包含7234个轨迹)。结果:广泛的定量和定性实验表明,SI在两个数据集中都明显优于最先进的基线。对于败血症,与最佳基线相比,SI降低了19.6%的估计死亡率。对于糖尿病,SI使HbA1c-High率降低了12.2%。所学到的政策与成功的临床决策密切相关,必要时也会在战略上有所偏离。这些偏差与最近的临床发现一致,表明预后改善。讨论:智能模仿者通过解决数据不完善和环境复杂性等挑战来推进RL应用,并在败血症和糖尿病的测试条件下展示有效性。需要在不同条件下进一步验证和探索额外的强化学习算法,以提高精度和泛化性。结论:本研究显示了从临床医生行为中学习个性化医疗保健以改善治疗结果的潜力。它的方法为在各种复杂和不确定的环境中自适应、个性化的策略提供了一个强大的方法。
{"title":"Smart Imitator: Learning from Imperfect Clinical Decisions.","authors":"Dilruk Perera, Siqi Liu, Kay Choong See, Mengling Feng","doi":"10.1093/jamia/ocae320","DOIUrl":"10.1093/jamia/ocae320","url":null,"abstract":"<p><strong>Objectives: </strong>This study introduces Smart Imitator (SI), a 2-phase reinforcement learning (RL) solution enhancing personalized treatment policies in healthcare, addressing challenges from imperfect clinician data and complex environments.</p><p><strong>Materials and methods: </strong>Smart Imitator's first phase uses adversarial cooperative imitation learning with a novel sample selection schema to categorize clinician policies from optimal to nonoptimal. The second phase creates a parameterized reward function to guide the learning of superior treatment policies through RL. Smart Imitator's effectiveness was validated on 2 datasets: a sepsis dataset with 19 711 patient trajectories and a diabetes dataset with 7234 trajectories.</p><p><strong>Results: </strong>Extensive quantitative and qualitative experiments showed that SI significantly outperformed state-of-the-art baselines in both datasets. For sepsis, SI reduced estimated mortality rates by 19.6% compared to the best baseline. For diabetes, SI reduced HbA1c-High rates by 12.2%. The learned policies aligned closely with successful clinical decisions and deviated strategically when necessary. These deviations aligned with recent clinical findings, suggesting improved outcomes.</p><p><strong>Discussion: </strong>Smart Imitator advances RL applications by addressing challenges such as imperfect data and environmental complexities, demonstrating effectiveness within the tested conditions of sepsis and diabetes. Further validation across diverse conditions and exploration of additional RL algorithms are needed to enhance precision and generalizability.</p><p><strong>Conclusion: </strong>This study shows potential in advancing personalized healthcare learning from clinician behaviors to improve treatment outcomes. Its methodology offers a robust approach for adaptive, personalized strategies in various complex and uncertain environments.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"49-66"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758472/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142962554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting mortality in hospitalized influenza patients: integration of deep learning-based chest X-ray severity score (FluDeep-XR) and clinical variables. 预测住院流感患者的死亡率:基于深度学习的胸部 X 光严重程度评分 (FluDeep-XR) 与临床变量的整合。
IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-01 DOI: 10.1093/jamia/ocae286
Meng-Han Tsai, Sung-Chu Ko, Amy Huaishiuan Huang, Lorenzo Porta, Cecilia Ferretti, Clarissa Longhi, Wan-Ting Hsu, Yung-Han Chang, Jo-Ching Hsiung, Chin-Hua Su, Filippo Galbiati, Chien-Chang Lee

Objectives: To pioneer the first artificial intelligence system integrating radiological and objective clinical data, simulating the clinical reasoning process, for the early prediction of high-risk influenza patients.

Materials and methods: Our system was developed using a cohort from National Taiwan University Hospital in Taiwan, with external validation data from ASST Grande Ospedale Metropolitano Niguarda in Italy. Convolutional neural networks pretrained on ImageNet were regressively trained using a 5-point scale to develop the influenza chest X-ray (CXR) severity scoring model, FluDeep-XR. Early, late, and joint fusion structures, incorporating varying weights of CXR severity with clinical data, were designed to predict 30-day mortality and compared with models using only CXR or clinical data. The best-performing model was designated as FluDeep. The explainability of FluDeep-XR and FluDeep was illustrated through activation maps and SHapley Additive exPlanations (SHAP).

Results: The Xception-based model, FluDeep-XR, achieved a mean square error of 0.738 in the external validation dataset. The Random Forest-based late fusion model, FluDeep, outperformed all the other models, achieving an area under the receiver operating curve of 0.818 and a sensitivity of 0.706 in the external dataset. Activation maps highlighted clear lung fields. Shapley additive explanations identified age, C-reactive protein, hematocrit, heart rate, and respiratory rate as the top 5 important clinical features.

Discussion: The integration of medical imaging with objective clinical data outperformed single-modality models to predict 30-day mortality in influenza patients. We ensured the explainability of our models aligned with clinical knowledge and validated its applicability across foreign institutions.

Conclusion: FluDeep highlights the potential of combining radiological and clinical information in late fusion design, enhancing diagnostic accuracy and offering an explainable, and generalizable decision support system.

目的开创首个整合放射学和客观临床数据的人工智能系统,模拟临床推理过程,用于早期预测高危流感患者:我们的系统是利用台湾国立台湾大学医院的队列数据开发的,外部验证数据来自意大利的 ASST Grande Ospedale Metropolitano Niguarda。在 ImageNet 上预先训练的卷积神经网络使用 5 点量表进行回归训练,从而开发出流感胸部 X 光(CXR)严重程度评分模型 FluDeep-XR。设计了早期、晚期和联合融合结构,将不同权重的 CXR 严重程度与临床数据相结合,用于预测 30 天死亡率,并与仅使用 CXR 或临床数据的模型进行比较。表现最好的模型被命名为 FluDeep。结果表明,FluDeep-XR 和 FluDeep 可通过激活图和 SHapley Additive exPlanations(SHAP)进行解释:结果:基于 Xception 的模型 FluDeep-XR 在外部验证数据集中的均方误差为 0.738。基于随机森林的后期融合模型 FluDeep 的表现优于所有其他模型,在外部数据集中的接收器工作曲线下面积为 0.818,灵敏度为 0.706。激活图突出显示了清晰的肺野。夏普利加性解释将年龄、C 反应蛋白、血细胞比容、心率和呼吸频率确定为最重要的 5 个临床特征:在预测流感患者 30 天死亡率方面,医学影像与客观临床数据的整合优于单一模式。我们确保了模型的可解释性与临床知识的一致性,并验证了其在国外机构的适用性:FluDeep凸显了在后期融合设计中结合放射学和临床信息的潜力,提高了诊断准确性,并提供了一个可解释、可推广的决策支持系统。
{"title":"Predicting mortality in hospitalized influenza patients: integration of deep learning-based chest X-ray severity score (FluDeep-XR) and clinical variables.","authors":"Meng-Han Tsai, Sung-Chu Ko, Amy Huaishiuan Huang, Lorenzo Porta, Cecilia Ferretti, Clarissa Longhi, Wan-Ting Hsu, Yung-Han Chang, Jo-Ching Hsiung, Chin-Hua Su, Filippo Galbiati, Chien-Chang Lee","doi":"10.1093/jamia/ocae286","DOIUrl":"10.1093/jamia/ocae286","url":null,"abstract":"<p><strong>Objectives: </strong>To pioneer the first artificial intelligence system integrating radiological and objective clinical data, simulating the clinical reasoning process, for the early prediction of high-risk influenza patients.</p><p><strong>Materials and methods: </strong>Our system was developed using a cohort from National Taiwan University Hospital in Taiwan, with external validation data from ASST Grande Ospedale Metropolitano Niguarda in Italy. Convolutional neural networks pretrained on ImageNet were regressively trained using a 5-point scale to develop the influenza chest X-ray (CXR) severity scoring model, FluDeep-XR. Early, late, and joint fusion structures, incorporating varying weights of CXR severity with clinical data, were designed to predict 30-day mortality and compared with models using only CXR or clinical data. The best-performing model was designated as FluDeep. The explainability of FluDeep-XR and FluDeep was illustrated through activation maps and SHapley Additive exPlanations (SHAP).</p><p><strong>Results: </strong>The Xception-based model, FluDeep-XR, achieved a mean square error of 0.738 in the external validation dataset. The Random Forest-based late fusion model, FluDeep, outperformed all the other models, achieving an area under the receiver operating curve of 0.818 and a sensitivity of 0.706 in the external dataset. Activation maps highlighted clear lung fields. Shapley additive explanations identified age, C-reactive protein, hematocrit, heart rate, and respiratory rate as the top 5 important clinical features.</p><p><strong>Discussion: </strong>The integration of medical imaging with objective clinical data outperformed single-modality models to predict 30-day mortality in influenza patients. We ensured the explainability of our models aligned with clinical knowledge and validated its applicability across foreign institutions.</p><p><strong>Conclusion: </strong>FluDeep highlights the potential of combining radiological and clinical information in late fusion design, enhancing diagnostic accuracy and offering an explainable, and generalizable decision support system.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"133-143"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142689371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of the American Medical Informatics Association
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1