Santosh Purja Pun, Oliver Obst, Jim Basilakis, Jeewani Anupama Ginige
Objectives: Mapping clinical classification systems, such as the International Classification of Diseases (ICD), is essential yet challenging. While the manual mapping method remains labor-intensive and lacks scalability, existing embedding-based automatic mapping methods, particularly those leveraging transformer-based pretrained encoders, encounter 2 persistent challenges: (1) linguistic variation and (2) varying granular details in clinical conditions.
Materials and methods: We introduce an automatic mapping method that combines the representational power of pretrained encoders with the reasoning capability of large language models (LLMs). For each ICD code, we generate: (1) hierarchy-augmented (HA) and (2) LLM-generated (LG) descriptions to capture rich semantic nuances, addressing linguistic variation. Furthermore, we introduced a prompting framework (PR) that leverages LLM reasoning to handle granularity mismatches, including source-to-parent mappings.
Results: Chapterwise mappings were performed between ICD versions (ICD-9-CM↔ICD-10-CM and ICD-10-AM↔ICD-11) using multiple LLMs. The proposed approach consistently outperformed the baseline across all ICD pairs and chapters. For example, combining HA descriptions with Qwen3-8B-generated descriptions yielded an average top-1 accuracy improvement of 6.5% (0.065) across the mapping cases. A small-scale pilot study further indicated that HA+LG remains effective in more challenging one-to-many mappings.
Conclusions: Our findings demonstrate that integrating the representational power of pretrained encoders with LLM reasoning offers a robust, scalable strategy for automatic ICD mapping.
{"title":"On embedding-based automatic mapping of clinical classification system: handling linguistic variations and granular inconsistencies.","authors":"Santosh Purja Pun, Oliver Obst, Jim Basilakis, Jeewani Anupama Ginige","doi":"10.1093/jamia/ocag004","DOIUrl":"https://doi.org/10.1093/jamia/ocag004","url":null,"abstract":"<p><strong>Objectives: </strong>Mapping clinical classification systems, such as the International Classification of Diseases (ICD), is essential yet challenging. While the manual mapping method remains labor-intensive and lacks scalability, existing embedding-based automatic mapping methods, particularly those leveraging transformer-based pretrained encoders, encounter 2 persistent challenges: (1) linguistic variation and (2) varying granular details in clinical conditions.</p><p><strong>Materials and methods: </strong>We introduce an automatic mapping method that combines the representational power of pretrained encoders with the reasoning capability of large language models (LLMs). For each ICD code, we generate: (1) hierarchy-augmented (HA) and (2) LLM-generated (LG) descriptions to capture rich semantic nuances, addressing linguistic variation. Furthermore, we introduced a prompting framework (PR) that leverages LLM reasoning to handle granularity mismatches, including source-to-parent mappings.</p><p><strong>Results: </strong>Chapterwise mappings were performed between ICD versions (ICD-9-CM↔ICD-10-CM and ICD-10-AM↔ICD-11) using multiple LLMs. The proposed approach consistently outperformed the baseline across all ICD pairs and chapters. For example, combining HA descriptions with Qwen3-8B-generated descriptions yielded an average top-1 accuracy improvement of 6.5% (0.065) across the mapping cases. A small-scale pilot study further indicated that HA+LG remains effective in more challenging one-to-many mappings.</p><p><strong>Conclusions: </strong>Our findings demonstrate that integrating the representational power of pretrained encoders with LLM reasoning offers a robust, scalable strategy for automatic ICD mapping.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146094776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahanazuddin Syed, Muayad Hamidi, Manju Bikkanuri, Nicole Adele Dierschke, Haritha Vardhini Katragadda, Meredith Zozus, Antonio Lucio Teixeira
Objectives: To evaluate the performance of a locally deployed adaptation of TrialGPT, a large language model (LLM) system for identifying trial-eligible patients from unstructured electronic health record (EHR) data.
Materials and methods: TrialGPT was re-engineered for secure, deployment at UT Health San Antonio using a locally hosted LLM. It was optimized for real-world data needs through a longitudinal patient-encounter-note hierarchy mirroring EHR documentation. Performance was evaluated in two stages: (1) benchmarking against an expert-adjudicated gold corpus (n = 149) and (2) comparative validation against manual screening (n = 55).
Results: Against the expert-adjudicated corpus, the system achieved 81.8% sensitivity, 97.8% specificity, and a positive predictive value of 75.0%. Compared with manual screening, it identified more than twice as many truly eligible patients (81.8% vs 36.4%) while preserving equivalent specificity.
Conclusion: The adapted TrialGPT framework operationalizes trial matching, translating EHR data into actionable screening intelligence for efficient, scalable clinical trial recruitment.
目的:评估本地部署的TrialGPT适应性的性能,TrialGPT是一种大型语言模型(LLM)系统,用于从非结构化电子健康记录(EHR)数据中识别符合试验条件的患者。材料和方法:TrialGPT经过重新设计,在UT Health San Antonio使用本地托管的LLM进行安全部署。它通过纵向的病人-遇到-笔记层次结构镜像EHR文档,针对现实世界的数据需求进行了优化。性能评估分为两个阶段:(1)针对专家评审的黄金语料库(n = 149)和(2)针对人工筛选的比较验证(n = 55)进行基准测试。结果:针对专家判定的语料库,该系统的敏感性为81.8%,特异性为97.8%,阳性预测值为75.0%。与人工筛查相比,它识别出的真正符合条件的患者数量是人工筛查的两倍多(81.8%对36.4%),同时保留了相同的特异性。结论:经过调整的TrialGPT框架可实现试验匹配,将电子病历数据转化为可操作的筛选情报,以实现高效、可扩展的临床试验招募。
{"title":"Translating evidence into practice: adapting TrialGPT for real-world clinical trial eligibility screening.","authors":"Mahanazuddin Syed, Muayad Hamidi, Manju Bikkanuri, Nicole Adele Dierschke, Haritha Vardhini Katragadda, Meredith Zozus, Antonio Lucio Teixeira","doi":"10.1093/jamia/ocag006","DOIUrl":"https://doi.org/10.1093/jamia/ocag006","url":null,"abstract":"<p><strong>Objectives: </strong>To evaluate the performance of a locally deployed adaptation of TrialGPT, a large language model (LLM) system for identifying trial-eligible patients from unstructured electronic health record (EHR) data.</p><p><strong>Materials and methods: </strong>TrialGPT was re-engineered for secure, deployment at UT Health San Antonio using a locally hosted LLM. It was optimized for real-world data needs through a longitudinal patient-encounter-note hierarchy mirroring EHR documentation. Performance was evaluated in two stages: (1) benchmarking against an expert-adjudicated gold corpus (n = 149) and (2) comparative validation against manual screening (n = 55).</p><p><strong>Results: </strong>Against the expert-adjudicated corpus, the system achieved 81.8% sensitivity, 97.8% specificity, and a positive predictive value of 75.0%. Compared with manual screening, it identified more than twice as many truly eligible patients (81.8% vs 36.4%) while preserving equivalent specificity.</p><p><strong>Conclusion: </strong>The adapted TrialGPT framework operationalizes trial matching, translating EHR data into actionable screening intelligence for efficient, scalable clinical trial recruitment.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huixue Zhou, Lisa Chow, Lisa Harnack, Satchidananda Panda, Emily N C Manoogian, Mingchen Li, Yongkang Xiao, Rui Zhang
Objectives: This study explores the use of advanced natural language processing (NLP) techniques to enhance food classification and dietary analysis using raw text input from a diet tracking app.
Materials and methods: The study was conducted in 3 stages: data collection, framework development, and application. Data were collected from a 12-week randomized controlled trial (RCT: NCT04259632), in which participants recorded their meals in free-text format using the myCircadianClock app. Only de-identified data were used. We developed nutrition-focused retrieval-augmented generation (NutriRAG), an NLP framework that uses a retrieval-augmented generation approach to enhance food classification from free-text inputs. The framework retrieves relevant examples from a curated database and then leverages large language models, such as GPT-4, to classify user-recorded food items into predefined categories without fine-tuning. NutriRAG was then applied to data from the RCT, which included 77 adults with obesity recruited from the Twin Cities metro area and randomized into 3 intervention groups: time-restricted eating (TRE, 8-hs eating window), caloric restriction (CR, 15% reduction), and unrestricted eating.
Results: NutriRAG significantly enhanced classification accuracy and helped to analyze dietary habits, as noted by the retrieval-augmented GPT-4 model achieving a micro-F1 score of 82.24. Both interventions showed dietary alterations: CR participants ate fewer snacks and sugary foods, while TRE participants reduced nighttime eating.
Conclusion: By using artificial intelligence, NutriRAG marks a substantial advancement in food classification and dietary analysis of nutritional assessments. The findings highlight NLP's potential to personalize nutrition and manage diet-related health issues, suggesting further research to expand these models for wider use.
{"title":"NutriRAG: unleashing the power of large language models for food identification and classification through retrieval methods.","authors":"Huixue Zhou, Lisa Chow, Lisa Harnack, Satchidananda Panda, Emily N C Manoogian, Mingchen Li, Yongkang Xiao, Rui Zhang","doi":"10.1093/jamia/ocag003","DOIUrl":"10.1093/jamia/ocag003","url":null,"abstract":"<p><strong>Objectives: </strong>This study explores the use of advanced natural language processing (NLP) techniques to enhance food classification and dietary analysis using raw text input from a diet tracking app.</p><p><strong>Materials and methods: </strong>The study was conducted in 3 stages: data collection, framework development, and application. Data were collected from a 12-week randomized controlled trial (RCT: NCT04259632), in which participants recorded their meals in free-text format using the myCircadianClock app. Only de-identified data were used. We developed nutrition-focused retrieval-augmented generation (NutriRAG), an NLP framework that uses a retrieval-augmented generation approach to enhance food classification from free-text inputs. The framework retrieves relevant examples from a curated database and then leverages large language models, such as GPT-4, to classify user-recorded food items into predefined categories without fine-tuning. NutriRAG was then applied to data from the RCT, which included 77 adults with obesity recruited from the Twin Cities metro area and randomized into 3 intervention groups: time-restricted eating (TRE, 8-hs eating window), caloric restriction (CR, 15% reduction), and unrestricted eating.</p><p><strong>Results: </strong>NutriRAG significantly enhanced classification accuracy and helped to analyze dietary habits, as noted by the retrieval-augmented GPT-4 model achieving a micro-F1 score of 82.24. Both interventions showed dietary alterations: CR participants ate fewer snacks and sugary foods, while TRE participants reduced nighttime eating.</p><p><strong>Conclusion: </strong>By using artificial intelligence, NutriRAG marks a substantial advancement in food classification and dietary analysis of nutritional assessments. The findings highlight NLP's potential to personalize nutrition and manage diet-related health issues, suggesting further research to expand these models for wider use.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146094754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raja Mazumder, Jonathon Keeney, Luke Johnson, Lori Krammer, Patrick McNeely, Jorge Sepulveda, Danielle Hangen, Maria Martin, Dushyanth Jyothi, Jonas De Almeida, Peter McGarvey, Adil Alaoui, Sarah Cha, Art Sedrakyan, Evan Shoelle, Michael Matheny, Michele LeNoue-Newton, Robert Winter, Stephen Deppen, Vahan Simonyan, Anelia Horvath
Objectives: Federated Ecosystems for Analytics and Standardized Technologies (FEAST) is a modular, cloud-based platform developed through the ARPA-H Biomedical Data Fabric initiative to enable secure, federated analysis of real-world biomedical data. To guide and iteratively refine its modular design, the FEAST team conducted a cross-institutional survey to systematically identify and prioritize research needs related to authorized-access data across diverse biomedical domains. This study presents a structured synthesis of submitted use cases to uncover infrastructure gaps, data integration challenges, and translational opportunities. The results from the survey inform both front-end user-facing functionality and backend data requirements, shaping how the interface supports user interactions, data types, and compliance with security and interoperability standards.
Materials and methods: A structured survey form was distributed to researchers affiliated with participating institutions, including DNA-HIVE, The George Washington University (GW-FEAST), Weill Cornell Medicine, Vanderbilt University Medical Center, Georgetown University, European Bioinformatics Institute, and Kaiser Permanente. Respondents completed standardized fields describing the data types of interest, project goals, analytic methods, and perceived technical barriers. The collected responses were curated and analyzed to identify common needs related to privacy, interoperability, scalability, and workflow reproducibility.
Results: The survey compiled 61 use cases spanning genomics, imaging, clinical phenotyping, EHR-driven analytics, and precision medicine. Common themes included the need for multi-modal data integration, HL7 FHIR-based secure access, federated model training without PII retention, and containerized microservices for scalable deployment. Convergent needs across institutions emphasized consistent demand for FAIR-compliant infrastructure and readiness for real-world data analytics.
Conclusion: The FEAST Use Cases survey provides a cross-sectional view of biomedical informatics priorities grounded in real-world data needs. The findings offer a strategic blueprint for developing federated, privacy-preserving infrastructure to support secure, collaborative, and scalable biomedical research.
{"title":"From use cases to infrastructure: a cross-institutional survey of priorities in data-driven biomedical research.","authors":"Raja Mazumder, Jonathon Keeney, Luke Johnson, Lori Krammer, Patrick McNeely, Jorge Sepulveda, Danielle Hangen, Maria Martin, Dushyanth Jyothi, Jonas De Almeida, Peter McGarvey, Adil Alaoui, Sarah Cha, Art Sedrakyan, Evan Shoelle, Michael Matheny, Michele LeNoue-Newton, Robert Winter, Stephen Deppen, Vahan Simonyan, Anelia Horvath","doi":"10.1093/jamia/ocag001","DOIUrl":"https://doi.org/10.1093/jamia/ocag001","url":null,"abstract":"<p><strong>Objectives: </strong>Federated Ecosystems for Analytics and Standardized Technologies (FEAST) is a modular, cloud-based platform developed through the ARPA-H Biomedical Data Fabric initiative to enable secure, federated analysis of real-world biomedical data. To guide and iteratively refine its modular design, the FEAST team conducted a cross-institutional survey to systematically identify and prioritize research needs related to authorized-access data across diverse biomedical domains. This study presents a structured synthesis of submitted use cases to uncover infrastructure gaps, data integration challenges, and translational opportunities. The results from the survey inform both front-end user-facing functionality and backend data requirements, shaping how the interface supports user interactions, data types, and compliance with security and interoperability standards.</p><p><strong>Materials and methods: </strong>A structured survey form was distributed to researchers affiliated with participating institutions, including DNA-HIVE, The George Washington University (GW-FEAST), Weill Cornell Medicine, Vanderbilt University Medical Center, Georgetown University, European Bioinformatics Institute, and Kaiser Permanente. Respondents completed standardized fields describing the data types of interest, project goals, analytic methods, and perceived technical barriers. The collected responses were curated and analyzed to identify common needs related to privacy, interoperability, scalability, and workflow reproducibility.</p><p><strong>Results: </strong>The survey compiled 61 use cases spanning genomics, imaging, clinical phenotyping, EHR-driven analytics, and precision medicine. Common themes included the need for multi-modal data integration, HL7 FHIR-based secure access, federated model training without PII retention, and containerized microservices for scalable deployment. Convergent needs across institutions emphasized consistent demand for FAIR-compliant infrastructure and readiness for real-world data analytics.</p><p><strong>Conclusion: </strong>The FEAST Use Cases survey provides a cross-sectional view of biomedical informatics priorities grounded in real-world data needs. The findings offer a strategic blueprint for developing federated, privacy-preserving infrastructure to support secure, collaborative, and scalable biomedical research.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Purpose: To highlight the importance of reporting negative results in large language model (LLM) research, particularly as these systems are increasingly integrated into healthcare.
Potential: LLMs offer transformative capabilities in text generation, summarization, and clinical decision support. Transparent documentation of both successes and failures can accelerate innovation, improve reproducibility, and guide safe deployment.
Caution: Publication bias toward positive findings conceals model limitations, biases, and reproducibility challenges. In healthcare, underreporting failures risks patient safety, ethical lapses, and wasted resources. Structural barriers, including a lack of standards and limited funding for failure analysis, perpetuate this cycle.
Conclusions: Negative results should be recognized as valuable contributions that delineate the boundaries of LLM applicability. Structured reporting, educational initiatives, and stronger incentives for transparency are essential to ensure responsible, equitable, and trustworthy use of LLMs in healthcare.
{"title":"Positive act of reporting negative results in large language model research: a call for transparency.","authors":"Satvik Tripathi, Dana Alkhulaifat, Tessa S Cook","doi":"10.1093/jamia/ocaf221","DOIUrl":"https://doi.org/10.1093/jamia/ocaf221","url":null,"abstract":"<p><strong>Purpose: </strong>To highlight the importance of reporting negative results in large language model (LLM) research, particularly as these systems are increasingly integrated into healthcare.</p><p><strong>Potential: </strong>LLMs offer transformative capabilities in text generation, summarization, and clinical decision support. Transparent documentation of both successes and failures can accelerate innovation, improve reproducibility, and guide safe deployment.</p><p><strong>Caution: </strong>Publication bias toward positive findings conceals model limitations, biases, and reproducibility challenges. In healthcare, underreporting failures risks patient safety, ethical lapses, and wasted resources. Structural barriers, including a lack of standards and limited funding for failure analysis, perpetuate this cycle.</p><p><strong>Conclusions: </strong>Negative results should be recognized as valuable contributions that delineate the boundaries of LLM applicability. Structured reporting, educational initiatives, and stronger incentives for transparency are essential to ensure responsible, equitable, and trustworthy use of LLMs in healthcare.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aparajita Kashyap, Christopher J Allsman, Elizabeth A Campbell, Pooja M Desai, Salvatore G Volpe, Bria P Massey, Tiffani J Bright, Suzanne Bakken, Oliver J Bear Don't Walk Iv, Adrienne Pichon
Objectives: Advancing health through informatics requires attending to justice. Recent policy changes in the United States have introduced significant barriers to promoting justice within informatics due to targeted funding cuts and hostility to science, especially science that prioritizes justice.
Materials and methods: We present five key principles for advancing a justice-oriented informatics agenda, synthesized from our workshop held at the American Medical Informatics Association 2022 Annual Symposium.
Results: These principles are: (1) Recognize knowledge and methodologies across communities; (2) Acknowledge historical and cultural contexts of interactions; (3) Facilitate transparency and accountability through clear measures and metrics; (4) Foster trust and sustainability; and (5) Equitably allocate compensation and resources.
Discussion and conclusion: We discuss barriers to implementing these principles that have arisen since the 2022 workshop and provide recommendations for moving towards justice-oriented informatics. We offer examples of how these principles may be used to frame challenges and adapt to new barriers within BMI.
{"title":"Contextualizing key principles to promote a justice-oriented informatics research agenda: proceedings and reflections from an American Medical Informatics Association workshop.","authors":"Aparajita Kashyap, Christopher J Allsman, Elizabeth A Campbell, Pooja M Desai, Salvatore G Volpe, Bria P Massey, Tiffani J Bright, Suzanne Bakken, Oliver J Bear Don't Walk Iv, Adrienne Pichon","doi":"10.1093/jamia/ocaf210","DOIUrl":"https://doi.org/10.1093/jamia/ocaf210","url":null,"abstract":"<p><strong>Objectives: </strong>Advancing health through informatics requires attending to justice. Recent policy changes in the United States have introduced significant barriers to promoting justice within informatics due to targeted funding cuts and hostility to science, especially science that prioritizes justice.</p><p><strong>Materials and methods: </strong>We present five key principles for advancing a justice-oriented informatics agenda, synthesized from our workshop held at the American Medical Informatics Association 2022 Annual Symposium.</p><p><strong>Results: </strong>These principles are: (1) Recognize knowledge and methodologies across communities; (2) Acknowledge historical and cultural contexts of interactions; (3) Facilitate transparency and accountability through clear measures and metrics; (4) Foster trust and sustainability; and (5) Equitably allocate compensation and resources.</p><p><strong>Discussion and conclusion: </strong>We discuss barriers to implementing these principles that have arisen since the 2022 workshop and provide recommendations for moving towards justice-oriented informatics. We offer examples of how these principles may be used to frame challenges and adapt to new barriers within BMI.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou
Background: The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.
Methods: We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.
Results: Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.
Conclusion: Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.
背景:生成式大型语言模型(llm)与电子健康记录(EHR)数据的使用正在迅速扩展,以支持临床和研究任务。这篇系统的综述描述了迄今为止已经研究和评估的临床领域和用例。方法:我们按照系统评价和荟萃分析的首选报告项目指南对PubMed和Web of Science在2023年1月1日至2024年11月9日期间发表的文章进行了系统评价。如果研究使用生成法学硕士来分析现实世界的电子病历数据和报告的定量绩效评估,则纳入研究。通过数据提取,我们确定了每篇纳入文章的临床专业和任务,并总结了评估方法。结果:在检索到的18735篇文章中,196篇符合我们的标准。大多数研究集中在放射学(26.0%)、肿瘤学(10.7%)和急诊医学(6.6%)。关于临床任务,临床决策支持占研究的最大比例(62.2%),而总结和患者沟通所占比例最小,分别为5.6%和5.1%。此外,GPT-4和GPT-3.5是最常用的生成型LLMs,分别出现在60.2%和57.7%的研究中。在这些研究中,我们确定了22个独特的非NLP指标和35个独特的NLP指标。虽然NLP指标提供了更大的可扩展性,但没有一个显示出与黄金标准的人类评估有很强的相关性。结论:我们的研究结果强调需要在更广泛的临床专业和任务中评估基于EHR数据的生成法学硕士,以及迫切需要标准化、可扩展和临床有意义的评估框架。
{"title":"Testing and evaluation of generative large language models in electronic health record applications: a systematic review.","authors":"Xinsong Du, Zhengyang Zhou, Yifei Wang, Ya-Wen Chuang, Yiming Li, Richard Yang, Wenyu Zhang, Xinyi Wang, Xinyu Chen, Hao Guan, John Lian, Pengyu Hong, David W Bates, Li Zhou","doi":"10.1093/jamia/ocaf233","DOIUrl":"https://doi.org/10.1093/jamia/ocaf233","url":null,"abstract":"<p><strong>Background: </strong>The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date.</p><p><strong>Methods: </strong>We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods.</p><p><strong>Results: </strong>Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations.</p><p><strong>Conclusion: </strong>Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson
Objective: To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.
Materials and methods: Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.
Results: We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).
Discussion: We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.
Conclusion: Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.
{"title":"Digital interdependence: impact of work spillover during clinical team handoffs.","authors":"Dori A Cross, Josh Weiner, Hannah T Neprash, Genevieve B Melton, Andrew Olson","doi":"10.1093/jamia/ocaf212","DOIUrl":"https://doi.org/10.1093/jamia/ocaf212","url":null,"abstract":"<p><strong>Objective: </strong>To characterize the nature and consequence(s) of interdependent physician electronic health record (EHR) work across inpatient shifts.</p><p><strong>Materials and methods: </strong>Pooled cross-sectional analysis of EHR metadata associated with hospital medicine patients at an academic medical center, January-June 2022. Using patient-day observation data, we use a mixed effects regression model with daytime physician random effects to examine nightshift behavior (handoff time, total EHR time) as a function of behaviors by the preceding daytime team. We also assess whether nighttime patient deterioration is predicted by team coordination behaviors across shifts.</p><p><strong>Results: </strong>We observed 19 671 patient days (N = 2708 encounters). Physicians used the handoff tool consistently, generally spending 8-12 minutes per shift editing patient information. When the day service team was more activated (highest tercile of handoff time, overall EHR time), nightshift experienced increased levels of EHR work and patient risk of overnight decline was elevated. (ie, Busy predicts busy). However, lower levels of dayshift activation were also associated with nightshift spillovers, including higher overnight EHR work and increased likelihood of patient clinical decline. Patient-days in the lowest and highest terciles of dayshift EHR time had a 1 percentage point increased relative risk of overnight decline (baseline prevalence of 4.4%) compared to the middle tercile (P = .04).</p><p><strong>Discussion: </strong>We find evidence of spillovers in EHR work from dayshift to nightshift. Additionally, the lowest and highest levels of dayshift EHR activity are associated with increased risk of overnight patient decline. Results are associational and motivate further examination of additional confounding factors.</p><p><strong>Conclusion: </strong>Analyses reveal opportunities to address task interdependence across shifts, using technology to flexibly shape and support collaborative teaming practices in complex clinical environments.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah
Objective: Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.
Materials and methods: We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.
Results: We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.
Discussion: EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.
目的:准确的表型是研究人员利用电子健康记录(EHR)相关的生物银行项目,如我们所有人研究项目来研究人类遗传学的一项重要任务。然而,关于如何选择基于ehr的表型程序以最大化下游统计能力的指导很少。本研究旨在估计All of Us (v7 release)中卵巢癌、女性乳腺癌和结直肠癌三种表型定义的准确性,并确定哪种表型定义最有可能优化遗传关联检测的下游统计能力。材料和方法:我们使用已知风险基因中有害变异的经验载体频率来估计每种表型定义的准确性,并在考虑结果错误分类的概率后计算统计功率。结果:我们发现,表型定义的选择对关联检测的统计能力有重大影响,没有一种方法对所有被测疾病都是最佳的。对于较为罕见的疾病和外显率中等或频率较低的目标风险等位基因,对功率的影响尤为严重。此外,我们的结果表明,高复杂性表型算法的准确性在All of Us的黑人和非西班牙裔白人参与者中是不一致的,突出了病例确定偏差影响下游关联测试的可能性。讨论:基于ehr的表型分型是最大限度地检测我们所有人的新型风险等位基因的瓶颈,也是研究人员应该意识到的差异结果错误分类的潜在来源。我们讨论了这方面的影响以及潜在的缓解策略。
{"title":"Measuring the accuracy of electronic health record-based phenotyping in the All of Us Research Program to optimize statistical power for genetic association testing.","authors":"John Baierl, Yi-Wen Hsiao, Michelle R Jones, Pei-Chen Peng, Paul D P Pharoah","doi":"10.1093/jamia/ocaf234","DOIUrl":"https://doi.org/10.1093/jamia/ocaf234","url":null,"abstract":"<p><strong>Objective: </strong>Accurate phenotyping is an essential task for researchers utilizing electronic health record (EHR)-linked biobank programs like the All of Us Research Program to study human genetics. However, little guidance is available on how to select an EHR-based phenotyping procedure that maximizes downstream statistical power. This study aims to estimate accuracy of three phenotype definitions of ovarian, female breast, and colorectal cancers in All of Us (v7 release) and determine which is most likely to optimize downstream statistical power for genetic association testing.</p><p><strong>Materials and methods: </strong>We used empirical carrier frequencies of deleterious variants in known risk genes to estimate the accuracy of each phenotype definition and compute statistical power after accounting for the probability of outcome misclassification.</p><p><strong>Results: </strong>We found that the choice of phenotype definition can have a substantial impact on statistical power for association testing and that no approach was optimal across all tested diseases. The impact on power was particularly acute for rarer diseases and target risk alleles of moderate penetrance or low frequency. Additionally, our results suggest that the accuracy of higher-complexity phenotyping algorithms is inconsistent across Black and non-Hispanic White participants in All of Us, highlighting the potential for case ascertainment biases to impact downstream association testing.</p><p><strong>Discussion: </strong>EHR-based phenotyping presents a bottleneck for maximizing power to detect novel risk alleles in All of Us, as well as a potential source of differential outcome misclassification that researchers should be aware of. We discuss the implications of this as well as potential mitigation strategies.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas
Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.
{"title":"Digital health literacy as mediator between language preference and telehealth use among Latinos in the United States.","authors":"Miguel Linares, Jorge A Rodriguez, Lauren E Wisk, Douglas S Bell, Arleen Brown, Alejandra Casillas","doi":"10.1093/jamia/ocaf232","DOIUrl":"https://doi.org/10.1093/jamia/ocaf232","url":null,"abstract":"<p><p>Using 2023-2024 U.S. National Health Interview Survey data, we found that digital health literacy (dHL) mediated nearly half of the difference in telehealth use between Latino adults with non-English and English language preference. These findings identify dHL as a modifiable mechanism linking linguistic and digital access barriers, underscoring the need for multilingual, inclusive, and equitable telehealth design.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}