Kathryn R Schneider, Hillary E Swann-Thomsen, Terry G Ribbens, Lucas A Bahnmaier, Trevor Satterfield, Reme Pullicar, Neeraj Soni
Background and significance: Physician and advanced practice provider (APP) well-being is a critical focus in healthcare. Emerging technology such as generative artificial intelligence (GAI) scribes reduces physician and APP administrative burden created by electronic health records. Early adopters of this technology have demonstrated promising improvements in clinical documentation, well-being, and cognitive load. However, further exploration across professional roles is warranted.
Objective: The goal of this quality improvement initiative was to explore how GAI scribes impacted well-being, cognitive load, and practice efficiency among physicians and APPs across professional roles.
Methods: A cross-sectional anonymous survey was conducted prior to implementation of GAI scribe technology and 3 months after physicians and APPs were onboarded.
Results: Physicians and APPs showed a reduction in cognitive task load following scribe technology implementation. Physicians reported reduced burnout and intent to leave; however, APPs did not have a significant reduction in burnout or intent to leave.
Conclusion: Artificial intelligence scribe technology shows potential for improving well-being among physicians and APPs by reducing cognitive load and clinical documentation time. Although some differences were found, overall, the technology appears to hold promise across professional roles.
{"title":"The impact of artificial intelligence scribes on physician and advanced practice provider cognitive load and well-being.","authors":"Kathryn R Schneider, Hillary E Swann-Thomsen, Terry G Ribbens, Lucas A Bahnmaier, Trevor Satterfield, Reme Pullicar, Neeraj Soni","doi":"10.1093/jamia/ocag005","DOIUrl":"https://doi.org/10.1093/jamia/ocag005","url":null,"abstract":"<p><strong>Background and significance: </strong>Physician and advanced practice provider (APP) well-being is a critical focus in healthcare. Emerging technology such as generative artificial intelligence (GAI) scribes reduces physician and APP administrative burden created by electronic health records. Early adopters of this technology have demonstrated promising improvements in clinical documentation, well-being, and cognitive load. However, further exploration across professional roles is warranted.</p><p><strong>Objective: </strong>The goal of this quality improvement initiative was to explore how GAI scribes impacted well-being, cognitive load, and practice efficiency among physicians and APPs across professional roles.</p><p><strong>Methods: </strong>A cross-sectional anonymous survey was conducted prior to implementation of GAI scribe technology and 3 months after physicians and APPs were onboarded.</p><p><strong>Results: </strong>Physicians and APPs showed a reduction in cognitive task load following scribe technology implementation. Physicians reported reduced burnout and intent to leave; however, APPs did not have a significant reduction in burnout or intent to leave.</p><p><strong>Conclusion: </strong>Artificial intelligence scribe technology shows potential for improving well-being among physicians and APPs by reducing cognitive load and clinical documentation time. Although some differences were found, overall, the technology appears to hold promise across professional roles.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146202888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raja Mazumder, Jonathon Keeney, Luke Johnson, Lori Krammer, Patrick McNeely, Jorge Sepulveda, Danielle Hangen, Maria Martin, Dushyanth Jyothi, Jonas De Almeida, Peter McGarvey, Adil Alaoui, Sarah Cha, Art Sedrakyan, Evan Shoelle, Michael Matheny, Michele LeNoue-Newton, Robert Winter, Stephen Deppen, Vahan Simonyan, Anelia Horvath
Objectives: Federated Ecosystems for Analytics and Standardized Technologies (FEAST) is a modular, cloud-based platform developed through the ARPA-H Biomedical Data Fabric initiative to enable secure, federated analysis of real-world biomedical data. To guide and iteratively refine its modular design, the FEAST team conducted a cross-institutional survey to systematically identify and prioritize research needs related to authorized-access data across diverse biomedical domains. This study presents a structured synthesis of submitted use cases to uncover infrastructure gaps, data integration challenges, and translational opportunities. The results from the survey inform both front-end user-facing functionality and backend data requirements, shaping how the interface supports user interactions, data types, and compliance with security and interoperability standards.
Materials and methods: A structured survey form was distributed to researchers affiliated with participating institutions, including DNA-HIVE, The George Washington University (GW-FEAST), Weill Cornell Medicine, Vanderbilt University Medical Center, Georgetown University, European Bioinformatics Institute, and Kaiser Permanente. Respondents completed standardized fields describing the data types of interest, project goals, analytic methods, and perceived technical barriers. The collected responses were curated and analyzed to identify common needs related to privacy, interoperability, scalability, and workflow reproducibility.
Results: The survey compiled 61 use cases spanning genomics, imaging, clinical phenotyping, EHR-driven analytics, and precision medicine. Common themes included the need for multi-modal data integration, HL7 FHIR-based secure access, federated model training without PII retention, and containerized microservices for scalable deployment. Convergent needs across institutions emphasized consistent demand for FAIR-compliant infrastructure and readiness for real-world data analytics.
Conclusion: The FEAST Use Cases survey provides a cross-sectional view of biomedical informatics priorities grounded in real-world data needs. The findings offer a strategic blueprint for developing federated, privacy-preserving infrastructure to support secure, collaborative, and scalable biomedical research.
{"title":"From use cases to infrastructure: a cross-institutional survey of priorities in data-driven biomedical research.","authors":"Raja Mazumder, Jonathon Keeney, Luke Johnson, Lori Krammer, Patrick McNeely, Jorge Sepulveda, Danielle Hangen, Maria Martin, Dushyanth Jyothi, Jonas De Almeida, Peter McGarvey, Adil Alaoui, Sarah Cha, Art Sedrakyan, Evan Shoelle, Michael Matheny, Michele LeNoue-Newton, Robert Winter, Stephen Deppen, Vahan Simonyan, Anelia Horvath","doi":"10.1093/jamia/ocag001","DOIUrl":"https://doi.org/10.1093/jamia/ocag001","url":null,"abstract":"<p><strong>Objectives: </strong>Federated Ecosystems for Analytics and Standardized Technologies (FEAST) is a modular, cloud-based platform developed through the ARPA-H Biomedical Data Fabric initiative to enable secure, federated analysis of real-world biomedical data. To guide and iteratively refine its modular design, the FEAST team conducted a cross-institutional survey to systematically identify and prioritize research needs related to authorized-access data across diverse biomedical domains. This study presents a structured synthesis of submitted use cases to uncover infrastructure gaps, data integration challenges, and translational opportunities. The results from the survey inform both front-end user-facing functionality and backend data requirements, shaping how the interface supports user interactions, data types, and compliance with security and interoperability standards.</p><p><strong>Materials and methods: </strong>A structured survey form was distributed to researchers affiliated with participating institutions, including DNA-HIVE, The George Washington University (GW-FEAST), Weill Cornell Medicine, Vanderbilt University Medical Center, Georgetown University, European Bioinformatics Institute, and Kaiser Permanente. Respondents completed standardized fields describing the data types of interest, project goals, analytic methods, and perceived technical barriers. The collected responses were curated and analyzed to identify common needs related to privacy, interoperability, scalability, and workflow reproducibility.</p><p><strong>Results: </strong>The survey compiled 61 use cases spanning genomics, imaging, clinical phenotyping, EHR-driven analytics, and precision medicine. Common themes included the need for multi-modal data integration, HL7 FHIR-based secure access, federated model training without PII retention, and containerized microservices for scalable deployment. Convergent needs across institutions emphasized consistent demand for FAIR-compliant infrastructure and readiness for real-world data analytics.</p><p><strong>Conclusion: </strong>The FEAST Use Cases survey provides a cross-sectional view of biomedical informatics priorities grounded in real-world data needs. The findings offer a strategic blueprint for developing federated, privacy-preserving infrastructure to support secure, collaborative, and scalable biomedical research.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Purpose: To highlight the importance of reporting negative results in large language model (LLM) research, particularly as these systems are increasingly integrated into healthcare.
Potential: LLMs offer transformative capabilities in text generation, summarization, and clinical decision support. Transparent documentation of both successes and failures can accelerate innovation, improve reproducibility, and guide safe deployment.
Caution: Publication bias toward positive findings conceals model limitations, biases, and reproducibility challenges. In healthcare, underreporting failures risks patient safety, ethical lapses, and wasted resources. Structural barriers, including a lack of standards and limited funding for failure analysis, perpetuate this cycle.
Conclusions: Negative results should be recognized as valuable contributions that delineate the boundaries of LLM applicability. Structured reporting, educational initiatives, and stronger incentives for transparency are essential to ensure responsible, equitable, and trustworthy use of LLMs in healthcare.
{"title":"Positive act of reporting negative results in large language model research: a call for transparency.","authors":"Satvik Tripathi, Dana Alkhulaifat, Tessa S Cook","doi":"10.1093/jamia/ocaf221","DOIUrl":"https://doi.org/10.1093/jamia/ocaf221","url":null,"abstract":"<p><strong>Purpose: </strong>To highlight the importance of reporting negative results in large language model (LLM) research, particularly as these systems are increasingly integrated into healthcare.</p><p><strong>Potential: </strong>LLMs offer transformative capabilities in text generation, summarization, and clinical decision support. Transparent documentation of both successes and failures can accelerate innovation, improve reproducibility, and guide safe deployment.</p><p><strong>Caution: </strong>Publication bias toward positive findings conceals model limitations, biases, and reproducibility challenges. In healthcare, underreporting failures risks patient safety, ethical lapses, and wasted resources. Structural barriers, including a lack of standards and limited funding for failure analysis, perpetuate this cycle.</p><p><strong>Conclusions: </strong>Negative results should be recognized as valuable contributions that delineate the boundaries of LLM applicability. Structured reporting, educational initiatives, and stronger incentives for transparency are essential to ensure responsible, equitable, and trustworthy use of LLMs in healthcare.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aparajita Kashyap, Christopher J Allsman, Elizabeth A Campbell, Pooja M Desai, Salvatore G Volpe, Bria P Massey, Tiffani J Bright, Suzanne Bakken, Oliver J Bear Don't Walk Iv, Adrienne Pichon
Objectives: Advancing health through informatics requires attending to justice. Recent policy changes in the United States have introduced significant barriers to promoting justice within informatics due to targeted funding cuts and hostility to science, especially science that prioritizes justice.
Materials and methods: We present five key principles for advancing a justice-oriented informatics agenda, synthesized from our workshop held at the American Medical Informatics Association 2022 Annual Symposium.
Results: These principles are: (1) Recognize knowledge and methodologies across communities; (2) Acknowledge historical and cultural contexts of interactions; (3) Facilitate transparency and accountability through clear measures and metrics; (4) Foster trust and sustainability; and (5) Equitably allocate compensation and resources.
Discussion and conclusion: We discuss barriers to implementing these principles that have arisen since the 2022 workshop and provide recommendations for moving towards justice-oriented informatics. We offer examples of how these principles may be used to frame challenges and adapt to new barriers within BMI.
{"title":"Contextualizing key principles to promote a justice-oriented informatics research agenda: proceedings and reflections from an American Medical Informatics Association workshop.","authors":"Aparajita Kashyap, Christopher J Allsman, Elizabeth A Campbell, Pooja M Desai, Salvatore G Volpe, Bria P Massey, Tiffani J Bright, Suzanne Bakken, Oliver J Bear Don't Walk Iv, Adrienne Pichon","doi":"10.1093/jamia/ocaf210","DOIUrl":"https://doi.org/10.1093/jamia/ocaf210","url":null,"abstract":"<p><strong>Objectives: </strong>Advancing health through informatics requires attending to justice. Recent policy changes in the United States have introduced significant barriers to promoting justice within informatics due to targeted funding cuts and hostility to science, especially science that prioritizes justice.</p><p><strong>Materials and methods: </strong>We present five key principles for advancing a justice-oriented informatics agenda, synthesized from our workshop held at the American Medical Informatics Association 2022 Annual Symposium.</p><p><strong>Results: </strong>These principles are: (1) Recognize knowledge and methodologies across communities; (2) Acknowledge historical and cultural contexts of interactions; (3) Facilitate transparency and accountability through clear measures and metrics; (4) Foster trust and sustainability; and (5) Equitably allocate compensation and resources.</p><p><strong>Discussion and conclusion: </strong>We discuss barriers to implementing these principles that have arisen since the 2022 workshop and provide recommendations for moving towards justice-oriented informatics. We offer examples of how these principles may be used to frame challenges and adapt to new barriers within BMI.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou
Background: The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.
Methods: We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.
Results: Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.
Conclusion: Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.
背景:生成式大型语言模型(llm)与电子健康记录(EHR)数据的使用正在迅速扩展,以支持临床和研究任务。这篇系统的综述描述了迄今为止已经研究和评估的临床领域和用例。方法:我们按照系统评价和荟萃分析的首选报告项目指南对PubMed和Web of Science在2023年1月1日至2024年11月9日期间发表的文章进行了系统评价。如果研究使用生成法学硕士来分析现实世界的电子病历数据和报告的定量绩效评估,则纳入研究。通过数据提取,我们确定了每篇纳入文章的临床专业和任务,并总结了评估方法。结果:在检索到的18735篇文章中,196篇符合我们的标准。大多数研究集中在放射学(26.0%)、肿瘤学(10.7%)和急诊医学(6.6%)。关于临床任务,临床决策支持占研究的最大比例(62.2%),而总结和患者沟通所占比例最小,分别为5.6%和5.1%。此外,GPT-4和GPT-3.5是最常用的生成型LLMs,分别出现在60.2%和57.7%的研究中。在这些研究中,我们确定了22个独特的非NLP指标和35个独特的NLP指标。虽然NLP指标提供了更大的可扩展性,但没有一个显示出与黄金标准的人类评估有很强的相关性。结论:我们的研究结果强调需要在更广泛的临床专业和任务中评估基于EHR数据的生成法学硕士,以及迫切需要标准化、可扩展和临床有意义的评估框架。
{"title":"Testing and evaluation of generative large language models in electronic health record applications: a systematic review.","authors":"Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou","doi":"10.1093/jamia/ocaf233","DOIUrl":"https://doi.org/10.1093/jamia/ocaf233","url":null,"abstract":"<p><strong>Background: </strong>The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.</p><p><strong>Methods: </strong>We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.</p><p><strong>Results: </strong>Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.</p><p><strong>Conclusion: </strong>Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson
Objective: To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.
Materials and methods: Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.
Results: We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).
Discussion: We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.
Conclusion: Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.
{"title":"Digital interdependence: impact of work spillover during clinical team handoffs.","authors":"Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson","doi":"10.1093/jamia/ocaf212","DOIUrl":"https://doi.org/10.1093/jamia/ocaf212","url":null,"abstract":"<p><strong>Objective: </strong>To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.</p><p><strong>Materials and methods: </strong>Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.</p><p><strong>Results: </strong>We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).</p><p><strong>Discussion: </strong>We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.</p><p><strong>Conclusion: </strong>Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah
Objective: Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.
Materials and methods: We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.
Results: We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.
Discussion: EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.
目的:准确的表型是研究人员利用电子健康记录(EHR)相关的生物银行项目,如我们所有人研究项目来研究人类遗传学的一项重要任务。然而,关于如何选择基于ehr的表型程序以最大化下游统计能力的指导很少。本研究旨在估计All of Us (v7 release)中卵巢癌、女性乳腺癌和结直肠癌三种表型定义的准确性,并确定哪种表型定义最有可能优化遗传关联检测的下游统计能力。材料和方法:我们使用已知风险基因中有害变异的经验载体频率来估计每种表型定义的准确性,并在考虑结果错误分类的概率后计算统计功率。结果:我们发现,表型定义的选择对关联检测的统计能力有重大影响,没有一种方法对所有被测疾病都是最佳的。对于较为罕见的疾病和外显率中等或频率较低的目标风险等位基因,对功率的影响尤为严重。此外,我们的结果表明,高复杂性表型算法的准确性在All of Us的黑人和非西班牙裔白人参与者中是不一致的,突出了病例确定偏差影响下游关联测试的可能性。讨论:基于ehr的表型分型是最大限度地检测我们所有人的新型风险等位基因的瓶颈,也是研究人员应该意识到的差异结果错误分类的潜在来源。我们讨论了这方面的影响以及潜在的缓解策略。
{"title":"Measuring the accuracy of electronic health record-based phenotyping in the All of Us Research Program to optimize statistical power for genetic association testing.","authors":"John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah","doi":"10.1093/jamia/ocaf234","DOIUrl":"https://doi.org/10.1093/jamia/ocaf234","url":null,"abstract":"<p><strong>Objective: </strong>Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.</p><p><strong>Materials and methods: </strong>We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.</p><p><strong>Results: </strong>We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.</p><p><strong>Discussion: </strong>EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas
Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.
{"title":"Digital health literacy as mediator between language preference and telehealth use among Latinos in the United States.","authors":"Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas","doi":"10.1093/jamia/ocaf232","DOIUrl":"https://doi.org/10.1093/jamia/ocaf232","url":null,"abstract":"<p><p>Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker
Objective: Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.
Materials and methods: We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.
Results: When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.
Conclusion: Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.
{"title":"Auditor models to suppress poor artificial intelligence predictions can improve human-artificial intelligence collaborative performance.","authors":"Katherine E Brown, Jesse O Wrenn, Nicholas J Jackson, Michael R Cauley, Benjamin X Collins, Laurie L Novak, Bradley A Malin, Jessica S Ancker","doi":"10.1093/jamia/ocaf235","DOIUrl":"10.1093/jamia/ocaf235","url":null,"abstract":"<p><strong>Objective: </strong>Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness-inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression-silencing predictions based on auditing the ML-shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.</p><p><strong>Materials and methods: </strong>We used data from the Vanderbilt University Medical Center electronic health record (n = 58 817) and the MIMIC-IV-ED dataset (n = 363 145) to predict likelihood of death or intensive care unit transfer and likelihood of 30-day readmission using gradient-boosted trees and an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of clinical decision support alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.</p><p><strong>Results: </strong>When the ML outperforms humans, suppression outperforms the human alone (P < 8.2 × 10-6) and at least does not degrade fairness. When the human outperforms the ML, the human is either fairer than suppression (P < 8.2 × 10-4) or there is no statistically significant difference in fairness. Incorporating uncertainty quantification into suppression approaches can improve performance.</p><p><strong>Conclusion: </strong>Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.
Objective: We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.
Methods: We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.
Results: Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.
Conclusion: Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.
{"title":"Structural insights into clinical large language models and their barriers to translational readiness.","authors":"Jiwon You, Hangsik Shin","doi":"10.1093/jamia/ocaf230","DOIUrl":"https://doi.org/10.1093/jamia/ocaf230","url":null,"abstract":"<p><strong>Background: </strong>Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation.</p><p><strong>Objective: </strong>We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment.</p><p><strong>Methods: </strong>We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics.</p><p><strong>Results: </strong>Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness.</p><p><strong>Conclusion: </strong>Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}