Nicholas J Jackson, Katherine E Brown, Rachael Miller, Matthew Murrow, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Natalie C Benda, Jessica S Ancker
Objectives: Research on artificial intelligence (AI)-based clinical decision-support (AI-CDS) systems has returned mixed results. Sometimes providing AI-CDS to a clinician will improve decision-making performance, sometimes it will not, and it is not always clear why. This scoping review seeks to clarify existing evidence by identifying clinician-level and technology design factors that impact the effectiveness of AI-assisted decision-making in medicine.
Materials and methods: We searched MEDLINE, Web of Science, and Embase for peer-reviewed papers that studied factors impacting the effectiveness of AI-CDS. We identified the factors studied and their impact on 3 outcomes: clinicians' attitudes toward AI, their decisions (eg, acceptance rate of AI recommendations), and their performance when utilizing AI-CDS.
Results: We retrieved 5850 articles and included 45. Four clinician-level and technology design factors were commonly studied. Expert clinicians may benefit less from AI-CDS than nonexperts, with some mixed results. Explainable AI increased clinicians' trust, but could also increase trust in incorrect AI recommendations, potentially harming human-AI collaborative performance. Clinicians' baseline attitudes toward AI predict their acceptance rates of AI recommendations. Of the 3 outcomes of interest, human-AI collaborative performance was most commonly assessed.
Discussion and conclusion: Few factors have been studied for their impact on the effectiveness of AI-CDS. Due to conflicting outcomes between studies, we recommend future work should leverage the concept of "appropriate trust" to facilitate more robust research on AI-CDS, aiming not to increase overall trust in or acceptance of AI but to ensure that clinicians accept AI recommendations only when trust in AI is warranted.
目的:基于人工智能(AI)的临床决策支持(AI- cds)系统的研究结果好坏参半。有时向临床医生提供AI-CDS会改善决策表现,有时不会,原因并不总是很清楚。本综述旨在通过确定影响人工智能辅助医学决策有效性的临床水平和技术设计因素来澄清现有证据。材料和方法:我们在MEDLINE、Web of Science和Embase上检索了研究影响AI-CDS有效性因素的同行评议论文。我们确定了研究的因素及其对3个结果的影响:临床医生对人工智能的态度,他们的决定(例如,人工智能建议的接受率),以及他们在使用人工智能cd时的表现。结果:共检索到5850篇文献,纳入45篇。通常研究四个临床水平和技术设计因素。与非专家相比,专家临床医生从AI-CDS中获益较少,结果好坏参半。可解释的人工智能增加了临床医生的信任,但也可能增加对不正确的人工智能建议的信任,从而潜在地损害人类与人工智能的协作绩效。临床医生对人工智能的基本态度预测了他们对人工智能建议的接受率。在我们感兴趣的3个结果中,人类与人工智能的协作性能是最常被评估的。讨论与结论:对AI-CDS有效性影响因素的研究较少。由于研究之间的结果相互矛盾,我们建议未来的工作应该利用“适当信任”的概念来促进对AI- cds的更强有力的研究,其目的不是增加对AI的整体信任或接受度,而是确保临床医生只有在对AI的信任得到保证时才接受AI建议。
{"title":"Factors influencing the effectiveness of artificial intelligence-assisted decision-making in medicine: a scoping review.","authors":"Nicholas J Jackson, Katherine E Brown, Rachael Miller, Matthew Murrow, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Natalie C Benda, Jessica S Ancker","doi":"10.1093/jamia/ocag002","DOIUrl":"10.1093/jamia/ocag002","url":null,"abstract":"<p><strong>Objectives: </strong>Research on artificial intelligence (AI)-based clinical decision-support (AI-CDS) systems has returned mixed results. Sometimes providing AI-CDS to a clinician will improve decision-making performance, sometimes it will not, and it is not always clear why. This scoping review seeks to clarify existing evidence by identifying clinician-level and technology design factors that impact the effectiveness of AI-assisted decision-making in medicine.</p><p><strong>Materials and methods: </strong>We searched MEDLINE, Web of Science, and Embase for peer-reviewed papers that studied factors impacting the effectiveness of AI-CDS. We identified the factors studied and their impact on 3 outcomes: clinicians' attitudes toward AI, their decisions (eg, acceptance rate of AI recommendations), and their performance when utilizing AI-CDS.</p><p><strong>Results: </strong>We retrieved 5850 articles and included 45. Four clinician-level and technology design factors were commonly studied. Expert clinicians may benefit less from AI-CDS than nonexperts, with some mixed results. Explainable AI increased clinicians' trust, but could also increase trust in incorrect AI recommendations, potentially harming human-AI collaborative performance. Clinicians' baseline attitudes toward AI predict their acceptance rates of AI recommendations. Of the 3 outcomes of interest, human-AI collaborative performance was most commonly assessed.</p><p><strong>Discussion and conclusion: </strong>Few factors have been studied for their impact on the effectiveness of AI-CDS. Due to conflicting outcomes between studies, we recommend future work should leverage the concept of \"appropriate trust\" to facilitate more robust research on AI-CDS, aiming not to increase overall trust in or acceptance of AI but to ensure that clinicians accept AI recommendations only when trust in AI is warranted.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146259741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahanazuddin Syed, Muayad Hamidi, Manju Bikkanuri, Nicole Adele Dierschke, Haritha Vardhini Katragadda, Meredith Zozus, Antonio Lucio Teixeira
Objectives: To evaluate the performance of a locally deployed adaptation of TrialGPT, a large language model (LLM) system for identifying trial-eligible patients from unstructured electronic health record (EHR) data.
Materials and methods: TrialGPT was re-engineered for secure, deployment at UT Health San Antonio using a locally hosted LLM. It was optimized for real-world data needs through a longitudinal patient-encounter-note hierarchy mirroring EHR documentation. Performance was evaluated in two stages: (1) benchmarking against an expert-adjudicated gold corpus (n = 149) and (2) comparative validation against manual screening (n = 55).
Results: Against the expert-adjudicated corpus, the system achieved 81.8% sensitivity, 97.8% specificity, and a positive predictive value of 75.0%. Compared with manual screening, it identified more than twice as many truly eligible patients (81.8% vs 36.4%) while preserving equivalent specificity.
Conclusion: The adapted TrialGPT framework operationalizes trial matching, translating EHR data into actionable screening intelligence for efficient, scalable clinical trial recruitment.
目的:评估本地部署的TrialGPT适应性的性能,TrialGPT是一种大型语言模型(LLM)系统,用于从非结构化电子健康记录(EHR)数据中识别符合试验条件的患者。材料和方法:TrialGPT经过重新设计,在UT Health San Antonio使用本地托管的LLM进行安全部署。它通过纵向的病人-遇到-笔记层次结构镜像EHR文档,针对现实世界的数据需求进行了优化。性能评估分为两个阶段:(1)针对专家评审的黄金语料库(n = 149)和(2)针对人工筛选的比较验证(n = 55)进行基准测试。结果:针对专家判定的语料库,该系统的敏感性为81.8%,特异性为97.8%,阳性预测值为75.0%。与人工筛查相比,它识别出的真正符合条件的患者数量是人工筛查的两倍多(81.8%对36.4%),同时保留了相同的特异性。结论:经过调整的TrialGPT框架可实现试验匹配,将电子病历数据转化为可操作的筛选情报,以实现高效、可扩展的临床试验招募。
{"title":"Translating evidence into practice: adapting TrialGPT for real-world clinical trial eligibility screening.","authors":"Mahanazuddin Syed, Muayad Hamidi, Manju Bikkanuri, Nicole Adele Dierschke, Haritha Vardhini Katragadda, Meredith Zozus, Antonio Lucio Teixeira","doi":"10.1093/jamia/ocag006","DOIUrl":"10.1093/jamia/ocag006","url":null,"abstract":"<p><strong>Objectives: </strong>To evaluate the performance of a locally deployed adaptation of TrialGPT, a large language model (LLM) system for identifying trial-eligible patients from unstructured electronic health record (EHR) data.</p><p><strong>Materials and methods: </strong>TrialGPT was re-engineered for secure, deployment at UT Health San Antonio using a locally hosted LLM. It was optimized for real-world data needs through a longitudinal patient-encounter-note hierarchy mirroring EHR documentation. Performance was evaluated in two stages: (1) benchmarking against an expert-adjudicated gold corpus (n = 149) and (2) comparative validation against manual screening (n = 55).</p><p><strong>Results: </strong>Against the expert-adjudicated corpus, the system achieved 81.8% sensitivity, 97.8% specificity, and a positive predictive value of 75.0%. Compared with manual screening, it identified more than twice as many truly eligible patients (81.8% vs 36.4%) while preserving equivalent specificity.</p><p><strong>Conclusion: </strong>The adapted TrialGPT framework operationalizes trial matching, translating EHR data into actionable screening intelligence for efficient, scalable clinical trial recruitment.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huixue Zhou, Lisa Chow, Lisa Harnack, Satchidananda Panda, Emily N C Manoogian, Mingchen Li, Yongkang Xiao, Rui Zhang
Objectives: This study explores the use of advanced natural language processing (NLP) techniques to enhance food classification and dietary analysis using raw text input from a diet tracking app.
Materials and methods: The study was conducted in 3 stages: data collection, framework development, and application. Data were collected from a 12-week randomized controlled trial (RCT: NCT04259632), in which participants recorded their meals in free-text format using the myCircadianClock app. Only de-identified data were used. We developed nutrition-focused retrieval-augmented generation (NutriRAG), an NLP framework that uses a retrieval-augmented generation approach to enhance food classification from free-text inputs. The framework retrieves relevant examples from a curated database and then leverages large language models, such as GPT-4, to classify user-recorded food items into predefined categories without fine-tuning. NutriRAG was then applied to data from the RCT, which included 77 adults with obesity recruited from the Twin Cities metro area and randomized into 3 intervention groups: time-restricted eating (TRE, 8-hs eating window), caloric restriction (CR, 15% reduction), and unrestricted eating.
Results: NutriRAG significantly enhanced classification accuracy and helped to analyze dietary habits, as noted by the retrieval-augmented GPT-4 model achieving a micro-F1 score of 82.24. Both interventions showed dietary alterations: CR participants ate fewer snacks and sugary foods, while TRE participants reduced nighttime eating.
Conclusion: By using artificial intelligence, NutriRAG marks a substantial advancement in food classification and dietary analysis of nutritional assessments. The findings highlight NLP's potential to personalize nutrition and manage diet-related health issues, suggesting further research to expand these models for wider use.
{"title":"NutriRAG: unleashing the power of large language models for food identification and classification through retrieval methods.","authors":"Huixue Zhou, Lisa Chow, Lisa Harnack, Satchidananda Panda, Emily N C Manoogian, Mingchen Li, Yongkang Xiao, Rui Zhang","doi":"10.1093/jamia/ocag003","DOIUrl":"10.1093/jamia/ocag003","url":null,"abstract":"<p><strong>Objectives: </strong>This study explores the use of advanced natural language processing (NLP) techniques to enhance food classification and dietary analysis using raw text input from a diet tracking app.</p><p><strong>Materials and methods: </strong>The study was conducted in 3 stages: data collection, framework development, and application. Data were collected from a 12-week randomized controlled trial (RCT: NCT04259632), in which participants recorded their meals in free-text format using the myCircadianClock app. Only de-identified data were used. We developed nutrition-focused retrieval-augmented generation (NutriRAG), an NLP framework that uses a retrieval-augmented generation approach to enhance food classification from free-text inputs. The framework retrieves relevant examples from a curated database and then leverages large language models, such as GPT-4, to classify user-recorded food items into predefined categories without fine-tuning. NutriRAG was then applied to data from the RCT, which included 77 adults with obesity recruited from the Twin Cities metro area and randomized into 3 intervention groups: time-restricted eating (TRE, 8-hs eating window), caloric restriction (CR, 15% reduction), and unrestricted eating.</p><p><strong>Results: </strong>NutriRAG significantly enhanced classification accuracy and helped to analyze dietary habits, as noted by the retrieval-augmented GPT-4 model achieving a micro-F1 score of 82.24. Both interventions showed dietary alterations: CR participants ate fewer snacks and sugary foods, while TRE participants reduced nighttime eating.</p><p><strong>Conclusion: </strong>By using artificial intelligence, NutriRAG marks a substantial advancement in food classification and dietary analysis of nutritional assessments. The findings highlight NLP's potential to personalize nutrition and manage diet-related health issues, suggesting further research to expand these models for wider use.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13005737/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146094754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kathryn R Schneider, Hillary E Swann-Thomsen, Terry G Ribbens, Lucas A Bahnmaier, Trevor Satterfield, Reme Pullicar, Neeraj Soni
Background and significance: Physician and advanced practice provider (APP) well-being is a critical focus in healthcare. Emerging technology such as generative artificial intelligence (GAI) scribes reduces physician and APP administrative burden created by electronic health records. Early adopters of this technology have demonstrated promising improvements in clinical documentation, well-being, and cognitive load. However, further exploration across professional roles is warranted.
Objective: The goal of this quality improvement initiative was to explore how GAI scribes impacted well-being, cognitive load, and practice efficiency among physicians and APPs across professional roles.
Methods: A cross-sectional anonymous survey was conducted prior to implementation of GAI scribe technology and 3 months after physicians and APPs were onboarded.
Results: Physicians and APPs showed a reduction in cognitive task load following scribe technology implementation. Physicians reported reduced burnout and intent to leave; however, APPs did not have a significant reduction in burnout or intent to leave.
Conclusion: Artificial intelligence scribe technology shows potential for improving well-being among physicians and APPs by reducing cognitive load and clinical documentation time. Although some differences were found, overall, the technology appears to hold promise across professional roles.
{"title":"The impact of artificial intelligence scribes on physician and advanced practice provider cognitive load and well-being.","authors":"Kathryn R Schneider, Hillary E Swann-Thomsen, Terry G Ribbens, Lucas A Bahnmaier, Trevor Satterfield, Reme Pullicar, Neeraj Soni","doi":"10.1093/jamia/ocag005","DOIUrl":"https://doi.org/10.1093/jamia/ocag005","url":null,"abstract":"<p><strong>Background and significance: </strong>Physician and advanced practice provider (APP) well-being is a critical focus in healthcare. Emerging technology such as generative artificial intelligence (GAI) scribes reduces physician and APP administrative burden created by electronic health records. Early adopters of this technology have demonstrated promising improvements in clinical documentation, well-being, and cognitive load. However, further exploration across professional roles is warranted.</p><p><strong>Objective: </strong>The goal of this quality improvement initiative was to explore how GAI scribes impacted well-being, cognitive load, and practice efficiency among physicians and APPs across professional roles.</p><p><strong>Methods: </strong>A cross-sectional anonymous survey was conducted prior to implementation of GAI scribe technology and 3 months after physicians and APPs were onboarded.</p><p><strong>Results: </strong>Physicians and APPs showed a reduction in cognitive task load following scribe technology implementation. Physicians reported reduced burnout and intent to leave; however, APPs did not have a significant reduction in burnout or intent to leave.</p><p><strong>Conclusion: </strong>Artificial intelligence scribe technology shows potential for improving well-being among physicians and APPs by reducing cognitive load and clinical documentation time. Although some differences were found, overall, the technology appears to hold promise across professional roles.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146202888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raja Mazumder, Jonathon Keeney, Luke Johnson, Lori Krammer, Patrick McNeely, Jorge Sepulveda, Danielle Hangen, Maria Martin, Dushyanth Jyothi, Jonas De Almeida, Peter McGarvey, Adil Alaoui, Sarah Cha, Art Sedrakyan, Evan Shoelle, Michael Matheny, Michele LeNoue-Newton, Robert Winter, Stephen Deppen, Vahan Simonyan, Anelia Horvath
Objectives: Federated Ecosystems for Analytics and Standardized Technologies (FEAST) is a modular, cloud-based platform developed through the ARPA-H Biomedical Data Fabric initiative to enable secure, federated analysis of real-world biomedical data. To guide and iteratively refine its modular design, the FEAST team conducted a cross-institutional survey to systematically identify and prioritize research needs related to authorized-access data across diverse biomedical domains. This study presents a structured synthesis of submitted use cases to uncover infrastructure gaps, data integration challenges, and translational opportunities. The results from the survey inform both front-end user-facing functionality and backend data requirements, shaping how the interface supports user interactions, data types, and compliance with security and interoperability standards.
Materials and methods: A structured survey form was distributed to researchers affiliated with participating institutions, including DNA-HIVE, The George Washington University (GW-FEAST), Weill Cornell Medicine, Vanderbilt University Medical Center, Georgetown University, European Bioinformatics Institute, and Kaiser Permanente. Respondents completed standardized fields describing the data types of interest, project goals, analytic methods, and perceived technical barriers. The collected responses were curated and analyzed to identify common needs related to privacy, interoperability, scalability, and workflow reproducibility.
Results: The survey compiled 61 use cases spanning genomics, imaging, clinical phenotyping, EHR-driven analytics, and precision medicine. Common themes included the need for multi-modal data integration, HL7 FHIR-based secure access, federated model training without PII retention, and containerized microservices for scalable deployment. Convergent needs across institutions emphasized consistent demand for FAIR-compliant infrastructure and readiness for real-world data analytics.
Conclusion: The FEAST Use Cases survey provides a cross-sectional view of biomedical informatics priorities grounded in real-world data needs. The findings offer a strategic blueprint for developing federated, privacy-preserving infrastructure to support secure, collaborative, and scalable biomedical research.
{"title":"From use cases to infrastructure: a cross-institutional survey of priorities in data-driven biomedical research.","authors":"Raja Mazumder, Jonathon Keeney, Luke Johnson, Lori Krammer, Patrick McNeely, Jorge Sepulveda, Danielle Hangen, Maria Martin, Dushyanth Jyothi, Jonas De Almeida, Peter McGarvey, Adil Alaoui, Sarah Cha, Art Sedrakyan, Evan Shoelle, Michael Matheny, Michele LeNoue-Newton, Robert Winter, Stephen Deppen, Vahan Simonyan, Anelia Horvath","doi":"10.1093/jamia/ocag001","DOIUrl":"https://doi.org/10.1093/jamia/ocag001","url":null,"abstract":"<p><strong>Objectives: </strong>Federated Ecosystems for Analytics and Standardized Technologies (FEAST) is a modular, cloud-based platform developed through the ARPA-H Biomedical Data Fabric initiative to enable secure, federated analysis of real-world biomedical data. To guide and iteratively refine its modular design, the FEAST team conducted a cross-institutional survey to systematically identify and prioritize research needs related to authorized-access data across diverse biomedical domains. This study presents a structured synthesis of submitted use cases to uncover infrastructure gaps, data integration challenges, and translational opportunities. The results from the survey inform both front-end user-facing functionality and backend data requirements, shaping how the interface supports user interactions, data types, and compliance with security and interoperability standards.</p><p><strong>Materials and methods: </strong>A structured survey form was distributed to researchers affiliated with participating institutions, including DNA-HIVE, The George Washington University (GW-FEAST), Weill Cornell Medicine, Vanderbilt University Medical Center, Georgetown University, European Bioinformatics Institute, and Kaiser Permanente. Respondents completed standardized fields describing the data types of interest, project goals, analytic methods, and perceived technical barriers. The collected responses were curated and analyzed to identify common needs related to privacy, interoperability, scalability, and workflow reproducibility.</p><p><strong>Results: </strong>The survey compiled 61 use cases spanning genomics, imaging, clinical phenotyping, EHR-driven analytics, and precision medicine. Common themes included the need for multi-modal data integration, HL7 FHIR-based secure access, federated model training without PII retention, and containerized microservices for scalable deployment. Convergent needs across institutions emphasized consistent demand for FAIR-compliant infrastructure and readiness for real-world data analytics.</p><p><strong>Conclusion: </strong>The FEAST Use Cases survey provides a cross-sectional view of biomedical informatics priorities grounded in real-world data needs. The findings offer a strategic blueprint for developing federated, privacy-preserving infrastructure to support secure, collaborative, and scalable biomedical research.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Purpose: To highlight the importance of reporting negative results in large language model (LLM) research, particularly as these systems are increasingly integrated into healthcare.
Potential: LLMs offer transformative capabilities in text generation, summarization, and clinical decision support. Transparent documentation of both successes and failures can accelerate innovation, improve reproducibility, and guide safe deployment.
Caution: Publication bias toward positive findings conceals model limitations, biases, and reproducibility challenges. In healthcare, underreporting failures risks patient safety, ethical lapses, and wasted resources. Structural barriers, including a lack of standards and limited funding for failure analysis, perpetuate this cycle.
Conclusions: Negative results should be recognized as valuable contributions that delineate the boundaries of LLM applicability. Structured reporting, educational initiatives, and stronger incentives for transparency are essential to ensure responsible, equitable, and trustworthy use of LLMs in healthcare.
{"title":"Positive act of reporting negative results in large language model research: a call for transparency.","authors":"Satvik Tripathi, Dana Alkhulaifat, Tessa S Cook","doi":"10.1093/jamia/ocaf221","DOIUrl":"https://doi.org/10.1093/jamia/ocaf221","url":null,"abstract":"<p><strong>Purpose: </strong>To highlight the importance of reporting negative results in large language model (LLM) research, particularly as these systems are increasingly integrated into healthcare.</p><p><strong>Potential: </strong>LLMs offer transformative capabilities in text generation, summarization, and clinical decision support. Transparent documentation of both successes and failures can accelerate innovation, improve reproducibility, and guide safe deployment.</p><p><strong>Caution: </strong>Publication bias toward positive findings conceals model limitations, biases, and reproducibility challenges. In healthcare, underreporting failures risks patient safety, ethical lapses, and wasted resources. Structural barriers, including a lack of standards and limited funding for failure analysis, perpetuate this cycle.</p><p><strong>Conclusions: </strong>Negative results should be recognized as valuable contributions that delineate the boundaries of LLM applicability. Structured reporting, educational initiatives, and stronger incentives for transparency are essential to ensure responsible, equitable, and trustworthy use of LLMs in healthcare.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aparajita Kashyap, Christopher J Allsman, Elizabeth A Campbell, Pooja M Desai, Salvatore G Volpe, Bria P Massey, Tiffani J Bright, Suzanne Bakken, Oliver J Bear Don't Walk Iv, Adrienne Pichon
Objectives: Advancing health through informatics requires attending to justice. Recent policy changes in the United States have introduced significant barriers to promoting justice within informatics due to targeted funding cuts and hostility to science, especially science that prioritizes justice.
Materials and methods: We present five key principles for advancing a justice-oriented informatics agenda, synthesized from our workshop held at the American Medical Informatics Association 2022 Annual Symposium.
Results: These principles are: (1) Recognize knowledge and methodologies across communities; (2) Acknowledge historical and cultural contexts of interactions; (3) Facilitate transparency and accountability through clear measures and metrics; (4) Foster trust and sustainability; and (5) Equitably allocate compensation and resources.
Discussion and conclusion: We discuss barriers to implementing these principles that have arisen since the 2022 workshop and provide recommendations for moving towards justice-oriented informatics. We offer examples of how these principles may be used to frame challenges and adapt to new barriers within BMI.
{"title":"Contextualizing key principles to promote a justice-oriented informatics research agenda: proceedings and reflections from an American Medical Informatics Association workshop.","authors":"Aparajita Kashyap, Christopher J Allsman, Elizabeth A Campbell, Pooja M Desai, Salvatore G Volpe, Bria P Massey, Tiffani J Bright, Suzanne Bakken, Oliver J Bear Don't Walk Iv, Adrienne Pichon","doi":"10.1093/jamia/ocaf210","DOIUrl":"https://doi.org/10.1093/jamia/ocaf210","url":null,"abstract":"<p><strong>Objectives: </strong>Advancing health through informatics requires attending to justice. Recent policy changes in the United States have introduced significant barriers to promoting justice within informatics due to targeted funding cuts and hostility to science, especially science that prioritizes justice.</p><p><strong>Materials and methods: </strong>We present five key principles for advancing a justice-oriented informatics agenda, synthesized from our workshop held at the American Medical Informatics Association 2022 Annual Symposium.</p><p><strong>Results: </strong>These principles are: (1) Recognize knowledge and methodologies across communities; (2) Acknowledge historical and cultural contexts of interactions; (3) Facilitate transparency and accountability through clear measures and metrics; (4) Foster trust and sustainability; and (5) Equitably allocate compensation and resources.</p><p><strong>Discussion and conclusion: </strong>We discuss barriers to implementing these principles that have arisen since the 2022 workshop and provide recommendations for moving towards justice-oriented informatics. We offer examples of how these principles may be used to frame challenges and adapt to new barriers within BMI.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury
Objective: Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.
Materials and methods: We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.
Results: Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.
Discussion: The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.
Conclusion: Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.
{"title":"Dependence of premature ventricular complexes on heart rate-it's not that simple.","authors":"Adrien Osakwe, Noah Wightman, Marc W Deyell, Zachary Laksman, Alvin Shrier, Gil Bub, Leon Glass, Thomas M Bury","doi":"10.1093/jamia/ocaf069","DOIUrl":"10.1093/jamia/ocaf069","url":null,"abstract":"<p><strong>Objective: </strong>Frequent premature ventricular complexes (PVCs) can lead to adverse health conditions such as cardiomyopathy. The linear correlation between PVC frequency and heart rate (as positive, negative, or neutral) on a 24-hour Holter recording has been proposed as a way to classify patients and guide treatment with beta-blockers. Our objective was to evaluate the robustness of this classification to measurement methodology, different 24-hour periods, and nonlinear dependencies of PVCs on heart rate.</p><p><strong>Materials and methods: </strong>We analyzed 82 multi-day Holter recordings (1-7 days) collected from 48 patients with frequent PVCs (burden 1%-44%). For each record, linear correlation between PVC frequency and heart rate was computed for different 24-hour periods and using different length intervals to determine PVC frequency.</p><p><strong>Results: </strong>Using a 1-hour interval, the correlation between PVC frequency and heart rate was consistently positive, negative, or neutral on different days in only 36.6% of patients. Using shorter time intervals, the correlation was consistent in 56.1% of patients. Shorter time intervals revealed nonlinear and piecewise linear relationships between PVC frequency and heart rate in many patients.</p><p><strong>Discussion: </strong>The variability of the correlation between PVC frequency and heart rate across different 24-hour periods and interval durations suggests that the relationship is neither strictly linear nor stationary. A better understanding of the mechanism driving the PVCs, combined with computational and biological models that represent these mechanisms, may provide insight into the observed nonlinear behavior and guide more robust classification strategies.</p><p><strong>Conclusion: </strong>Linear correlation as a tool to classify patients with frequent PVCs should be used with caution. It is sensitive to the specific 24-hour period analyzed and the methodology used to segment the data. More sophisticated classification approaches that can capture nonlinear and time-varying dependencies should be developed and considered in clinical practice.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"90-97"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758478/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144055982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objectives: To improve prediction of chronic kidney disease (CKD) progression to end-stage renal disease (ESRD) using machine learning (ML) and deep learning (DL) models applied to integrated clinical and claims data with varying observation windows, supported by explainable artificial intelligence (AI) to enhance interpretability and reduce bias.
Materials and methods: We utilized data from 10 326 CKD patients, combining clinical and claims information from 2009 to 2018. After preprocessing, cohort identification, and feature engineering, we evaluated multiple statistical, ML and DL models using 5 distinct observation windows. Feature importance and SHapley Additive exPlanations (SHAP) analysis were employed to understand key predictors. Models were tested for robustness, clinical relevance, misclassification patterns, and bias.
Results: Integrated data models outperformed single data source models, with long short-term memory achieving the highest area under the receiver operating characteristic curve (AUROC) (0.93) and F1 score (0.65). A 24-month observation window optimally balanced early detection and prediction accuracy. The 2021 estimated glomerular filtration rate (eGFR) equation improved prediction accuracy and reduced racial bias, particularly for African American patients.
Discussion: Improved prediction accuracy, interpretability, and bias mitigation strategies have the potential to enhance CKD management, support targeted interventions, and reduce health-care disparities.
Conclusion: This study presents a robust framework for predicting ESRD outcomes, improving clinical decision-making through integrated multisourced data and advanced analytics. Future research will expand data integration and extend this framework to other chronic diseases.
{"title":"Enhancing end-stage renal disease outcome prediction: a multisourced data-driven approach.","authors":"Yubo Li, Rema Padman","doi":"10.1093/jamia/ocaf118","DOIUrl":"10.1093/jamia/ocaf118","url":null,"abstract":"<p><strong>Objectives: </strong>To improve prediction of chronic kidney disease (CKD) progression to end-stage renal disease (ESRD) using machine learning (ML) and deep learning (DL) models applied to integrated clinical and claims data with varying observation windows, supported by explainable artificial intelligence (AI) to enhance interpretability and reduce bias.</p><p><strong>Materials and methods: </strong>We utilized data from 10 326 CKD patients, combining clinical and claims information from 2009 to 2018. After preprocessing, cohort identification, and feature engineering, we evaluated multiple statistical, ML and DL models using 5 distinct observation windows. Feature importance and SHapley Additive exPlanations (SHAP) analysis were employed to understand key predictors. Models were tested for robustness, clinical relevance, misclassification patterns, and bias.</p><p><strong>Results: </strong>Integrated data models outperformed single data source models, with long short-term memory achieving the highest area under the receiver operating characteristic curve (AUROC) (0.93) and F1 score (0.65). A 24-month observation window optimally balanced early detection and prediction accuracy. The 2021 estimated glomerular filtration rate (eGFR) equation improved prediction accuracy and reduced racial bias, particularly for African American patients.</p><p><strong>Discussion: </strong>Improved prediction accuracy, interpretability, and bias mitigation strategies have the potential to enhance CKD management, support targeted interventions, and reduce health-care disparities.</p><p><strong>Conclusion: </strong>This study presents a robust framework for predicting ESRD outcomes, improving clinical decision-making through integrated multisourced data and advanced analytics. Future research will expand data integration and extend this framework to other chronic diseases.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"26-36"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758457/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding
Objective: Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.
Materials and methods: SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.
Results: Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.
Discussion: This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.
Conclusion: The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.
{"title":"SDoH-GPT: using large language models to extract social determinants of health.","authors":"Bernardo Consoli, Haoyang Wang, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding","doi":"10.1093/jamia/ocaf094","DOIUrl":"10.1093/jamia/ocaf094","url":null,"abstract":"<p><strong>Objective: </strong>Extracting social determinants of health (SDoHs) from medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. Here, we introduce SDoH-GPT, a novel framework leveraging few-shot learning large language models (LLMs) to automate the extraction of SDoH from unstructured text, aiming to improve both efficiency and generalizability.</p><p><strong>Materials and methods: </strong>SDoH-GPT is a framework including the few-shot learning LLM methods to extract the SDoH from medical notes and the XGBoost classifiers which continue to classify SDoH using the annotations generated by the few-shot learning LLM methods as training datasets. The unique combination of the few-shot learning LLM methods with XGBoost utilizes the strength of LLMs as great few shot learners and the efficiency of XGBoost when the training dataset is sufficient. Therefore, SDoH-GPT can extract SDoH without relying on extensive medical annotations or costly human intervention.</p><p><strong>Results: </strong>Our approach achieved tenfold and twentyfold reductions in time and cost, respectively, and superior consistency with human annotators measured by Cohen's kappa of up to 0.92. The innovative combination of LLM and XGBoost can ensure high accuracy and computational efficiency while consistently maintaining 0.90+ AUROC scores.</p><p><strong>Discussion: </strong>This study has verified SDoH-GPT on three datasets and highlights the potential of leveraging LLM and XGBoost to revolutionize medical note classification, demonstrating its capability to achieve highly accurate classifications with significantly reduced time and cost.</p><p><strong>Conclusion: </strong>The key contribution of this study is the integration of LLM with XGBoost, which enables cost-effective and high quality annotations of SDoH. This research sets the stage for SDoH can be more accessible, scalable, and impactful in driving future healthcare solutions.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"67-78"},"PeriodicalIF":4.6,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758468/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144267837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}