Jessica J Pourian, Ben Michaels, Anh Vo, A Jay Holmgren, Augusto Garcia-Agundez, Valerie Flaherman
Background and significance: Acute otitis media (AOM) is a leading cause of pediatric antibiotic overuse. Safety Net Antibiotic Prescriptions (SNAPs) are recommended for antibiotic stewardship but are difficult to identify due to lack of structured documentation.
Objective: This study validates the accuracy of Versa, a GPT-4o based HIPAA-compliant large language model (LLM), to classify AOM treatment plans from physician notes.
Methods: A retrospective cross-sectional study analyzed pediatric AOM encounters. Multiple prompting strategies were used to classify treatment plans and validated against a representative sample of manual reviews by 2 pediatricians. A locally fine-tuned model, Clinical-Longformer was also trained and tested against Versa and human review.
Results: In total, 5707 encounters were included; 374 reviewed manually. Zero-shot accuracy was 97.8%; few-shot accuracy was 85%. Clinical-Longformer achieved 93.3% accuracy.
Conclusion: Versa effectively identifies AOM treatment plans, providing a cost-efficient quality improvement tracking tool for prescription practice patterns in pediatric antibiotic stewardship efforts.
{"title":"A SNAPpy use of large language models: using large language models to classify treatment plans in pediatric acute otitis media.","authors":"Jessica J Pourian, Ben Michaels, Anh Vo, A Jay Holmgren, Augusto Garcia-Agundez, Valerie Flaherman","doi":"10.1093/jamia/ocaf170","DOIUrl":"10.1093/jamia/ocaf170","url":null,"abstract":"<p><strong>Background and significance: </strong>Acute otitis media (AOM) is a leading cause of pediatric antibiotic overuse. Safety Net Antibiotic Prescriptions (SNAPs) are recommended for antibiotic stewardship but are difficult to identify due to lack of structured documentation.</p><p><strong>Objective: </strong>This study validates the accuracy of Versa, a GPT-4o based HIPAA-compliant large language model (LLM), to classify AOM treatment plans from physician notes.</p><p><strong>Methods: </strong>A retrospective cross-sectional study analyzed pediatric AOM encounters. Multiple prompting strategies were used to classify treatment plans and validated against a representative sample of manual reviews by 2 pediatricians. A locally fine-tuned model, Clinical-Longformer was also trained and tested against Versa and human review.</p><p><strong>Results: </strong>In total, 5707 encounters were included; 374 reviewed manually. Zero-shot accuracy was 97.8%; few-shot accuracy was 85%. Clinical-Longformer achieved 93.3% accuracy.</p><p><strong>Conclusion: </strong>Versa effectively identifies AOM treatment plans, providing a cost-efficient quality improvement tracking tool for prescription practice patterns in pediatric antibiotic stewardship efforts.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1947-1951"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646383/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145259774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shelly Soffer, Mahmud Omar, Moran Gendler, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander W Charney, Girish N Nadkarni, Eyal Klang
Objectives: Text embeddings are promising for semantic tasks, such as retrieval augmented generation (RAG). However, their application in health care is underexplored due to a lack of benchmarking methods. We introduce a scalable benchmarking method to test embeddings for health-care semantic tasks.
Materials and methods: We evaluated 39 embedding models across 7 medical semantic similarity tasks using diverse datasets. These datasets comprised real-world patient data (from the Mount Sinai Health System and MIMIC IV), biomedical texts from PubMed, and synthetic data generated with Llama-3-70b. We first assessed semantic textual similarity (STS) by correlating the model-generated similarity scores with noise levels using Spearman rank correlation. We then reframed the same tasks as retrieval problems, evaluated by mean reciprocal rank and recall at k.
Results: In total, evaluating 2000 text pairs per 7 tasks for STS and retrieval yielded 3.28 million model assessments. Larger models (>7b parameters), such as those based on Mistral-7b and Gemma-2-9b, consistently performed well, especially in long-context tasks. The NV-Embed-v1 model (7b parameters), although top in short tasks, underperformed in long tasks. For short tasks, smaller models such as b1ade-embed (335M parameters) performed on-par to the larger models. For long retrieval tasks, the larger models significantly outperformed the smaller ones.
Discussion: The proposed benchmarking framework demonstrates scalability and flexibility, offering a structured approach to guide the selection of embedding models for a wide range of health-care tasks.
Conclusion: By matching the appropriate model with the task, the framework enables more effective deployment of embedding models, enhancing critical applications such as semantic search and retrieval-augmented generation (RAG).
目的:文本嵌入在语义任务中很有前途,例如检索增强生成(RAG)。然而,由于缺乏基准方法,它们在医疗保健中的应用尚未得到充分探索。我们引入了一种可扩展的基准测试方法来测试医疗保健语义任务的嵌入。材料和方法:我们使用不同的数据集评估了7个医学语义相似度任务中的39个嵌入模型。这些数据集包括真实世界的患者数据(来自Mount Sinai Health System和MIMIC IV)、PubMed的生物医学文献以及Llama-3-70b生成的合成数据。我们首先通过使用Spearman秩相关将模型生成的相似性得分与噪声水平相关联来评估语义文本相似性(STS)。然后,我们将相同的任务重新定义为检索问题,通过k的平均对等等级和召回率进行评估。结果:在STS和检索的7个任务中评估2000个文本对总共产生了328万个模型评估。较大的模型(bbb7b参数),如基于Mistral-7b和Gemma-2-9b的模型,一直表现良好,特别是在长上下文任务中。NV-Embed-v1模型(7b个参数)虽然在短任务中表现最好,但在长任务中表现不佳。对于较短的任务,较小的模型(如blade -embed (335M参数))的执行与较大的模型相当。对于较长的检索任务,较大的模型明显优于较小的模型。讨论:拟议的基准框架展示了可伸缩性和灵活性,提供了一种结构化的方法来指导为广泛的医疗保健任务选择嵌入模型。结论:通过将适当的模型与任务相匹配,该框架能够更有效地部署嵌入模型,增强关键应用,如语义搜索和检索增强生成(RAG)。
{"title":"A scalable framework for benchmark embedding models in semantic health-care tasks.","authors":"Shelly Soffer, Mahmud Omar, Moran Gendler, Benjamin S Glicksberg, Patricia Kovatch, Orly Efros, Robert Freeman, Alexander W Charney, Girish N Nadkarni, Eyal Klang","doi":"10.1093/jamia/ocaf149","DOIUrl":"10.1093/jamia/ocaf149","url":null,"abstract":"<p><strong>Objectives: </strong>Text embeddings are promising for semantic tasks, such as retrieval augmented generation (RAG). However, their application in health care is underexplored due to a lack of benchmarking methods. We introduce a scalable benchmarking method to test embeddings for health-care semantic tasks.</p><p><strong>Materials and methods: </strong>We evaluated 39 embedding models across 7 medical semantic similarity tasks using diverse datasets. These datasets comprised real-world patient data (from the Mount Sinai Health System and MIMIC IV), biomedical texts from PubMed, and synthetic data generated with Llama-3-70b. We first assessed semantic textual similarity (STS) by correlating the model-generated similarity scores with noise levels using Spearman rank correlation. We then reframed the same tasks as retrieval problems, evaluated by mean reciprocal rank and recall at k.</p><p><strong>Results: </strong>In total, evaluating 2000 text pairs per 7 tasks for STS and retrieval yielded 3.28 million model assessments. Larger models (>7b parameters), such as those based on Mistral-7b and Gemma-2-9b, consistently performed well, especially in long-context tasks. The NV-Embed-v1 model (7b parameters), although top in short tasks, underperformed in long tasks. For short tasks, smaller models such as b1ade-embed (335M parameters) performed on-par to the larger models. For long retrieval tasks, the larger models significantly outperformed the smaller ones.</p><p><strong>Discussion: </strong>The proposed benchmarking framework demonstrates scalability and flexibility, offering a structured approach to guide the selection of embedding models for a wide range of health-care tasks.</p><p><strong>Conclusion: </strong>By matching the appropriate model with the task, the framework enables more effective deployment of embedding models, enhancing critical applications such as semantic search and retrieval-augmented generation (RAG).</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1877-1887"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646376/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145115030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samrachana Adhikari, Tyrel Stokes, Xiyue Li, Yunan Zhao, Cassidy Fitchett, Nathalia Ladino, Steven Lawrence, Min Qian, Young S Cho, Carine Hamo, John A Dodson, Rumi Chunara, Ian M Kronish, Amrita Mukhopadhyay, Saul B Blecker
Objective: While timely interventions can improve medication adherence, it is challenging to identify which patients are at risk of nonadherence at point-of-care. We aim to develop and validate flexible machine learning (ML) models to predict a continuous measure of adherence to guideline-directed medication therapies (GDMTs) for heart failure (HF).
Materials and methods: We utilized a large electronic health record (EHR) cohort of 34,697 HF patients seen at NYU Langone Health with an active prescription for ≥1 GDMT between April 01, 2021 and October 31, 2022. The outcome was adherence to GDMT measured as proportion of days covered (PDC) at 6 months following a clinical encounter. Over 120 predictors included patient-, therapy-, healthcare-, and neighborhood-level factors guided by the World Health Organization's model of barriers to adherence. We compared performance of several ML models and their ensemble (superlearner) for predicting PDC with traditional regression model (OLS) using mean absolute error (MAE) averaged across 10-fold cross-validation, % increase in MAE relative to superlearner, and predictive-difference across deciles of predicted PDC.
Results: Superlearner, a flexible nonparametric prediction approach, demonstrated superior prediction performance. Superlearner and quantile random forest had the lowest MAE (mean [95% CI] = 18.9% [18.7%-19.1%] for both), followed by MAEs for quantile neural network (19.5% [19.3%-19.7%]) and kernel support vector regression (19.8% [19.6%-20.0%]). Gradient boosted trees and OLS were the 2 worst performing models with 17% and 14% higher MAEs, respectively, relative to superlearner. Superlearner demonstrated improved predictive difference.
Conclusion: This development phase study suggests potential of linked EHR-pharmacy data and ML to identify HF patients who will benefit from medication adherence interventions.
Discussion: Fairness evaluation and external validation are needed prior to clinical integration.
{"title":"Machine learning based prediction of medication adherence in heart failure using large electronic health record cohort with linkages to pharmacy-fill and neighborhood-level data.","authors":"Samrachana Adhikari, Tyrel Stokes, Xiyue Li, Yunan Zhao, Cassidy Fitchett, Nathalia Ladino, Steven Lawrence, Min Qian, Young S Cho, Carine Hamo, John A Dodson, Rumi Chunara, Ian M Kronish, Amrita Mukhopadhyay, Saul B Blecker","doi":"10.1093/jamia/ocaf162","DOIUrl":"10.1093/jamia/ocaf162","url":null,"abstract":"<p><strong>Objective: </strong>While timely interventions can improve medication adherence, it is challenging to identify which patients are at risk of nonadherence at point-of-care. We aim to develop and validate flexible machine learning (ML) models to predict a continuous measure of adherence to guideline-directed medication therapies (GDMTs) for heart failure (HF).</p><p><strong>Materials and methods: </strong>We utilized a large electronic health record (EHR) cohort of 34,697 HF patients seen at NYU Langone Health with an active prescription for ≥1 GDMT between April 01, 2021 and October 31, 2022. The outcome was adherence to GDMT measured as proportion of days covered (PDC) at 6 months following a clinical encounter. Over 120 predictors included patient-, therapy-, healthcare-, and neighborhood-level factors guided by the World Health Organization's model of barriers to adherence. We compared performance of several ML models and their ensemble (superlearner) for predicting PDC with traditional regression model (OLS) using mean absolute error (MAE) averaged across 10-fold cross-validation, % increase in MAE relative to superlearner, and predictive-difference across deciles of predicted PDC.</p><p><strong>Results: </strong>Superlearner, a flexible nonparametric prediction approach, demonstrated superior prediction performance. Superlearner and quantile random forest had the lowest MAE (mean [95% CI] = 18.9% [18.7%-19.1%] for both), followed by MAEs for quantile neural network (19.5% [19.3%-19.7%]) and kernel support vector regression (19.8% [19.6%-20.0%]). Gradient boosted trees and OLS were the 2 worst performing models with 17% and 14% higher MAEs, respectively, relative to superlearner. Superlearner demonstrated improved predictive difference.</p><p><strong>Conclusion: </strong>This development phase study suggests potential of linked EHR-pharmacy data and ML to identify HF patients who will benefit from medication adherence interventions.</p><p><strong>Discussion: </strong>Fairness evaluation and external validation are needed prior to clinical integration.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1822-1832"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646373/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145201938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fateme Nateghi Haredasht, Ivan Lopez, Steven Tate, Pooya Ashtari, Min Min Chan, Deepali Kulkarni, Chwen-Yuen Angie Chen, Maithri Vangala, Kira Griffith, Bryan Bunning, Adam S Miner, Tina Hernandez-Boussard, Keith Humphreys, Anna Lembke, L Alexander Vance, Jonathan H Chen
Objective: Building upon our previous work on predicting treatment retention in medications for opioid use disorder, we aimed to improve 6-month retention prediction in buprenorphine-naloxone (BUP-NAL) therapy by incorporating features derived from large language models (LLMs) applied to unstructured clinical notes.
Materials and methods: We used de-identified electronic health record (EHR) data from Stanford Health Care (STARR) for model development and internal validation, and the NeuroBlu behavioral health database for external validation. Structured features were supplemented with 13 clinical and psychosocial features extracted from free-text notes using the CLinical Entity Augmented Retrieval pipeline, which combines named entity recognition with LLM-based classification to provide contextual interpretation. We trained classification (Logistic Regression, Random Forest, XGBoost) and survival models (CoxPH, Random Survival Forest, Survival XGBoost), evaluated using Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and C-index.
Results: XGBoost achieved the highest classification performance (ROC-AUC = 0.65). Incorporating LLM-derived features improved model performance across all architectures, with the largest gains observed in simpler models such as Logistic Regression. In time-to-event analysis, Random Survival Forest and Survival XGBoost reached the highest C-index (≈0.65). SHapley Additive exPlanations analysis identified LLM-extracted features like Chronic Pain, Liver Disease, and Major Depression as key predictors. We also developed an interactive web tool for real-time clinical use.
Discussion: Features extracted using NLP and LLM-assisted methods improved model accuracy and interpretability, revealing valuable psychosocial risks not captured in structured EHRs.
Conclusion: Combining structured EHR data with LLM-extracted features moderately improves BUP-NAL retention prediction, enabling personalized risk stratification and advancing AI-driven care for substance use disorders.
目的:在我们之前预测阿片类药物使用障碍药物治疗保留的基础上,我们旨在通过结合应用于非结构化临床记录的大型语言模型(LLMs)的特征,提高丁丙诺啡-纳洛酮(BUP-NAL)治疗6个月的保留预测。材料和方法:我们使用来自Stanford health Care (STARR)的去识别电子健康记录(EHR)数据进行模型开发和内部验证,并使用NeuroBlu行为健康数据库进行外部验证。使用临床实体增强检索管道从自由文本笔记中提取13个临床和社会心理特征,补充结构化特征,该管道将命名实体识别与基于llm的分类相结合,以提供上下文解释。我们训练了分类(Logistic Regression, Random Forest, XGBoost)和生存模型(CoxPH, Random survival Forest, survival XGBoost),并使用受试者工作特征曲线下面积(ROC-AUC)和C-index进行评估。结果:XGBoost的分类性能最高(ROC-AUC = 0.65)。合并llm派生的特性可以改善所有架构中的模型性能,在简单的模型(如Logistic Regression)中可以观察到最大的收益。在时间-事件分析中,Random Survival Forest和Survival XGBoost的C-index最高(≈0.65)。SHapley加性解释分析确定了llm提取的特征,如慢性疼痛、肝脏疾病和重度抑郁症是关键的预测因素。我们还开发了一个交互式网络工具,用于实时临床使用。讨论:使用NLP和llm辅助方法提取的特征提高了模型的准确性和可解释性,揭示了结构化电子病历中未捕获的有价值的社会心理风险。结论:将结构化的EHR数据与llm提取的特征相结合,适度改善了BUP-NAL保留预测,实现了个性化的风险分层,并推进了人工智能驱动的物质使用障碍护理。
{"title":"Predicting treatment retention in medication for opioid use disorder: a machine learning approach using NLP and LLM-derived clinical features.","authors":"Fateme Nateghi Haredasht, Ivan Lopez, Steven Tate, Pooya Ashtari, Min Min Chan, Deepali Kulkarni, Chwen-Yuen Angie Chen, Maithri Vangala, Kira Griffith, Bryan Bunning, Adam S Miner, Tina Hernandez-Boussard, Keith Humphreys, Anna Lembke, L Alexander Vance, Jonathan H Chen","doi":"10.1093/jamia/ocaf157","DOIUrl":"10.1093/jamia/ocaf157","url":null,"abstract":"<p><strong>Objective: </strong>Building upon our previous work on predicting treatment retention in medications for opioid use disorder, we aimed to improve 6-month retention prediction in buprenorphine-naloxone (BUP-NAL) therapy by incorporating features derived from large language models (LLMs) applied to unstructured clinical notes.</p><p><strong>Materials and methods: </strong>We used de-identified electronic health record (EHR) data from Stanford Health Care (STARR) for model development and internal validation, and the NeuroBlu behavioral health database for external validation. Structured features were supplemented with 13 clinical and psychosocial features extracted from free-text notes using the CLinical Entity Augmented Retrieval pipeline, which combines named entity recognition with LLM-based classification to provide contextual interpretation. We trained classification (Logistic Regression, Random Forest, XGBoost) and survival models (CoxPH, Random Survival Forest, Survival XGBoost), evaluated using Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and C-index.</p><p><strong>Results: </strong>XGBoost achieved the highest classification performance (ROC-AUC = 0.65). Incorporating LLM-derived features improved model performance across all architectures, with the largest gains observed in simpler models such as Logistic Regression. In time-to-event analysis, Random Survival Forest and Survival XGBoost reached the highest C-index (≈0.65). SHapley Additive exPlanations analysis identified LLM-extracted features like Chronic Pain, Liver Disease, and Major Depression as key predictors. We also developed an interactive web tool for real-time clinical use.</p><p><strong>Discussion: </strong>Features extracted using NLP and LLM-assisted methods improved model accuracy and interpretability, revealing valuable psychosocial risks not captured in structured EHRs.</p><p><strong>Conclusion: </strong>Combining structured EHR data with LLM-extracted features moderately improves BUP-NAL retention prediction, enabling personalized risk stratification and advancing AI-driven care for substance use disorders.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1865-1876"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646374/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145114959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sinan L Aktay, Ozan A Aktay, Samia Menon, Shuo Jim Huang, Rozalina G McCoy
Objectives: Gaps in transportation, particularly public transit, are a significant barrier to accessible, high-quality healthcare. Health systems, payors, and regulatory bodies recognize the need to identify and address these gaps. However, clinical research examining public transportation accessibility and its impacts on healthcare utilization, outcomes, and costs remains limited. Existing tools used for studying public transit are generally non-HIPAA compliant, expensive, proprietary, and/or difficult to use. A tool addressing these concerns is needed to enable the incorporation of transportation variables into research and clinical care settings.
Materials and methods: We developed and implemented a novel framework for building a public transit routing system that is comprised of free, publicly available data and offline software to maintain HIPAA compliance. The system consists of a transit router and a geocoder for converting addresses into coordinates.
Results: A total of 463 879 out of 505 379 (∼91.8%) of Baltimore, Maryland, addresses were successfully routed to University of Maryland Medical Center in 24 hours of compute time. A significant portion of journeys consisted of walking (36% of median trip time) or using a transit vehicle (57.2%). Testing the router with varying random-access memory levels showed a plateau in routing speed between 12 and 20 GB. The geocoding approach is >90% consistent with a widely used but non-HIPAA compliant geocoder.
Discussion: The methodology and step-by-step guidance shared in this study can allow researchers, public health professionals, non-for-profit agencies, and other stakeholders to efficiently, effectively, and safely incorporate public transportation information into their work.
Conclusion: Public transportation routing using freely available data and software is possible in a HIPAA-compliant manner.
{"title":"Supporting public transit research in healthcare settings: testing a free, fast, and secure method for routing public transit from patient address to the point of care.","authors":"Sinan L Aktay, Ozan A Aktay, Samia Menon, Shuo Jim Huang, Rozalina G McCoy","doi":"10.1093/jamia/ocaf161","DOIUrl":"10.1093/jamia/ocaf161","url":null,"abstract":"<p><strong>Objectives: </strong>Gaps in transportation, particularly public transit, are a significant barrier to accessible, high-quality healthcare. Health systems, payors, and regulatory bodies recognize the need to identify and address these gaps. However, clinical research examining public transportation accessibility and its impacts on healthcare utilization, outcomes, and costs remains limited. Existing tools used for studying public transit are generally non-HIPAA compliant, expensive, proprietary, and/or difficult to use. A tool addressing these concerns is needed to enable the incorporation of transportation variables into research and clinical care settings.</p><p><strong>Materials and methods: </strong>We developed and implemented a novel framework for building a public transit routing system that is comprised of free, publicly available data and offline software to maintain HIPAA compliance. The system consists of a transit router and a geocoder for converting addresses into coordinates.</p><p><strong>Results: </strong>A total of 463 879 out of 505 379 (∼91.8%) of Baltimore, Maryland, addresses were successfully routed to University of Maryland Medical Center in 24 hours of compute time. A significant portion of journeys consisted of walking (36% of median trip time) or using a transit vehicle (57.2%). Testing the router with varying random-access memory levels showed a plateau in routing speed between 12 and 20 GB. The geocoding approach is >90% consistent with a widely used but non-HIPAA compliant geocoder.</p><p><strong>Discussion: </strong>The methodology and step-by-step guidance shared in this study can allow researchers, public health professionals, non-for-profit agencies, and other stakeholders to efficiently, effectively, and safely incorporate public transportation information into their work.</p><p><strong>Conclusion: </strong>Public transportation routing using freely available data and software is possible in a HIPAA-compliant manner.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1802-1810"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646371/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145201900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lisa Pilgram, Samer El Kababji, Dan Liu, Khaled El Emam
Objective: In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD).
Material and methods: We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component.
Results: Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables.
Discussion: Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset.
Conclusion: Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.
{"title":"Should we synthesize more than we need: impact of synthetic data generation for high-dimensional cross-sectional medical data.","authors":"Lisa Pilgram, Samer El Kababji, Dan Liu, Khaled El Emam","doi":"10.1093/jamia/ocaf169","DOIUrl":"10.1093/jamia/ocaf169","url":null,"abstract":"<p><strong>Objective: </strong>In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD).</p><p><strong>Material and methods: </strong>We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component.</p><p><strong>Results: </strong>Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables.</p><p><strong>Discussion: </strong>Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset.</p><p><strong>Conclusion: </strong>Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1843-1854"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646385/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145259724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimitrios Bounias, Lina Simons, Michael Baumgartner, Chris Ehring, Peter Neher, Lorenz A Kapsner, Balint Kovacs, Ralf Floca, Paul F Jaeger, Jessica Eberle, Dominique Hadler, Frederik B Laun, Sabine Ohlmeyer, Lena Maier-Hein, Michael Uder, Evelyn Wenkel, Klaus H Maier-Hein, Sebastian Bickelhaupt
Objectives: Breast diffusion-weighted imaging (DWI) has shown potential as a standalone imaging technique for certain indications, eg, supplemental screening of women with dense breasts. This study evaluates an artificial intelligence (AI)-powered computer-aided diagnosis (CAD) system for clinical interpretation and workload reduction in breast DWI.
Materials and methods: This retrospective IRB-approved study included: n = 824 examinations for model development (2017-2020) and n = 235 for evaluation (01/2021-06/2021). Readings were performed by three readers using either the AI-CAD or manual readings. BI-RADS-like (Breast Imaging Reporting and Data System) classification was based on DWI. Histopathology served as ground truth. The model was nnDetection-based, trained using 5-fold cross-validation and ensembling. Statistical significance was determined using McNemar's test. Inter-rater agreement was calculated using Cohen's kappa. Model performance was calculated using the area under the receiver operating curve (AUC).
Results: The AI-augmented approach significantly reduced BI-RADS-like 3 calls in breast DWI by 29% (P =.019) and increased interrater agreement (0.57 ± 0.10 vs 0.49 ± 0.11), while preserving diagnostic accuracy. Two of the three readers detected more malignant lesions (63/69 vs 59/69 and 64/69 vs 62/69) with the AI-CAD. The AI model achieved an AUC of 0.78 (95% CI: [0.72, 0.85]; P <.001), which increased for women at screening age to 0.82 (95% CI: [0.73, 0.90]; P <.001), indicating a potential for workload reduction of 20.9% at 96% sensitivity.
Discussion and conclusion: Breast DWI might benefit from AI support. In our study, AI showed potential for reduction of BI-RADS-like 3 calls and increase of inter-rater agreement. However, given the limited study size, further research is needed.
目的:乳腺弥散加权成像(DWI)已经显示出作为一种独立成像技术在某些适应症中的潜力,例如,对致密乳房的女性进行补充筛查。本研究评估了一种人工智能(AI)驱动的计算机辅助诊断(CAD)系统,用于临床解释和减少乳腺DWI的工作量。材料和方法:这项经irb批准的回顾性研究包括:n = 824例模型开发检查(2017-2020)和n = 235例评估检查(2021年1月- 2021年6月)。读数由三名读者使用AI-CAD或手动读数进行。bi - rads类(乳腺成像报告和数据系统)分类基于DWI。组织病理学是最基本的事实。该模型基于nndetection,使用5倍交叉验证和集成进行训练。采用McNemar检验确定统计学显著性。评级机构间的协议是用科恩的kappa来计算的。模型性能计算使用面积下的接收者工作曲线(AUC)。结果:人工智能增强方法在保持诊断准确性的同时,显著减少了乳房DWI中bi - rads样3次呼叫29% (P = 0.019),提高了判据一致性(0.57±0.10 vs 0.49±0.11)。三名读卡器中有两名使用AI-CAD检测到更多的恶性病变(63/69 vs 59/69, 64/69 vs 62/69)。人工智能模型的AUC为0.78 (95% CI: [0.72, 0.85]; P讨论和结论:乳房DWI可能受益于人工智能的支持。在我们的研究中,人工智能显示出减少bi - rad -like 3呼叫和增加评级间协议的潜力。然而,由于研究规模有限,还需要进一步的研究。
{"title":"Including AI in diffusion-weighted breast MRI has potential to increase reader confidence and reduce workload.","authors":"Dimitrios Bounias, Lina Simons, Michael Baumgartner, Chris Ehring, Peter Neher, Lorenz A Kapsner, Balint Kovacs, Ralf Floca, Paul F Jaeger, Jessica Eberle, Dominique Hadler, Frederik B Laun, Sabine Ohlmeyer, Lena Maier-Hein, Michael Uder, Evelyn Wenkel, Klaus H Maier-Hein, Sebastian Bickelhaupt","doi":"10.1093/jamia/ocaf156","DOIUrl":"10.1093/jamia/ocaf156","url":null,"abstract":"<p><strong>Objectives: </strong>Breast diffusion-weighted imaging (DWI) has shown potential as a standalone imaging technique for certain indications, eg, supplemental screening of women with dense breasts. This study evaluates an artificial intelligence (AI)-powered computer-aided diagnosis (CAD) system for clinical interpretation and workload reduction in breast DWI.</p><p><strong>Materials and methods: </strong>This retrospective IRB-approved study included: n = 824 examinations for model development (2017-2020) and n = 235 for evaluation (01/2021-06/2021). Readings were performed by three readers using either the AI-CAD or manual readings. BI-RADS-like (Breast Imaging Reporting and Data System) classification was based on DWI. Histopathology served as ground truth. The model was nnDetection-based, trained using 5-fold cross-validation and ensembling. Statistical significance was determined using McNemar's test. Inter-rater agreement was calculated using Cohen's kappa. Model performance was calculated using the area under the receiver operating curve (AUC).</p><p><strong>Results: </strong>The AI-augmented approach significantly reduced BI-RADS-like 3 calls in breast DWI by 29% (P =.019) and increased interrater agreement (0.57 ± 0.10 vs 0.49 ± 0.11), while preserving diagnostic accuracy. Two of the three readers detected more malignant lesions (63/69 vs 59/69 and 64/69 vs 62/69) with the AI-CAD. The AI model achieved an AUC of 0.78 (95% CI: [0.72, 0.85]; P <.001), which increased for women at screening age to 0.82 (95% CI: [0.73, 0.90]; P <.001), indicating a potential for workload reduction of 20.9% at 96% sensitivity.</p><p><strong>Discussion and conclusion: </strong>Breast DWI might benefit from AI support. In our study, AI showed potential for reduction of BI-RADS-like 3 calls and increase of inter-rater agreement. However, given the limited study size, further research is needed.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1908-1915"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646386/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145126441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Merlin Engelke, Giulia Baldini, Jens Kleesiek, Felix Nensa, Amin Dada
Objective: To address the challenges of data heterogeneity and manual feature engineering in clinical predictive modeling, we introduce FHIR-Former, an open-source framework integrating Fast Healthcare Interoperability Resources (FHIR) with large language models (LLMs) to automate and standardize clinical prediction tasks.
Materials and methods: FHIR-Former dynamically processes structured (eg, lab results, medications) and unstructured (eg, clinical notes) data from FHIR resources. The pipeline supports multiple classification tasks, including 30-day readmission, imaging study prediction, and ICD code classification. Leveraging open-source LLMs (GeBERTa), we trained models on 1.1 million data points across ten FHIR resources using retrospective inpatient data (2018-2024). Hyperparameters were optimized via Bayesian methods, and outputs were mapped to FHIR RiskAssessment resources for interoperability.
Results: FHIR-Former achieved an F1-score of 70.7% and accuracy of 72.9% for 30-day readmission, 51.8% F1-score (88.1% accuracy) for mortality prediction, and 61% macro F1-score for imaging study classification. The ICD code prediction model attained 94% accuracy. Performance demonstrated promising performance for readmission and showed scalability across tasks without manual feature engineering.
Discussion: FHIR-Former eliminates institution-specific preprocessing by adapting to diverse FHIR implementations, enabling seamless integration of multimodal data. Its configurable architecture outperformed prior frameworks reliant on static inputs or limited to unstructured text. Real-time risk scores embedded in FHIR servers enhance clinical workflows without disrupting existing practices.
Conclusion: By harmonizing FHIR standardization with LLM flexibility, FHIR-Former advances scalable, interoperable predictive modeling in healthcare. The open-source framework facilitates automation, improves resource allocation, and supports personalized decision-making, bridging gaps between AI innovation and clinical practice.
{"title":"FHIR-Former: enhancing clinical predictions through Fast Healthcare Interoperability Resources and large language models.","authors":"Merlin Engelke, Giulia Baldini, Jens Kleesiek, Felix Nensa, Amin Dada","doi":"10.1093/jamia/ocaf165","DOIUrl":"10.1093/jamia/ocaf165","url":null,"abstract":"<p><strong>Objective: </strong>To address the challenges of data heterogeneity and manual feature engineering in clinical predictive modeling, we introduce FHIR-Former, an open-source framework integrating Fast Healthcare Interoperability Resources (FHIR) with large language models (LLMs) to automate and standardize clinical prediction tasks.</p><p><strong>Materials and methods: </strong>FHIR-Former dynamically processes structured (eg, lab results, medications) and unstructured (eg, clinical notes) data from FHIR resources. The pipeline supports multiple classification tasks, including 30-day readmission, imaging study prediction, and ICD code classification. Leveraging open-source LLMs (GeBERTa), we trained models on 1.1 million data points across ten FHIR resources using retrospective inpatient data (2018-2024). Hyperparameters were optimized via Bayesian methods, and outputs were mapped to FHIR RiskAssessment resources for interoperability.</p><p><strong>Results: </strong>FHIR-Former achieved an F1-score of 70.7% and accuracy of 72.9% for 30-day readmission, 51.8% F1-score (88.1% accuracy) for mortality prediction, and 61% macro F1-score for imaging study classification. The ICD code prediction model attained 94% accuracy. Performance demonstrated promising performance for readmission and showed scalability across tasks without manual feature engineering.</p><p><strong>Discussion: </strong>FHIR-Former eliminates institution-specific preprocessing by adapting to diverse FHIR implementations, enabling seamless integration of multimodal data. Its configurable architecture outperformed prior frameworks reliant on static inputs or limited to unstructured text. Real-time risk scores embedded in FHIR servers enhance clinical workflows without disrupting existing practices.</p><p><strong>Conclusion: </strong>By harmonizing FHIR standardization with LLM flexibility, FHIR-Former advances scalable, interoperable predictive modeling in healthcare. The open-source framework facilitates automation, improves resource allocation, and supports personalized decision-making, bridging gaps between AI innovation and clinical practice.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1793-1801"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646377/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145287573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vijeeth Guggilla, Mengjia Kang, Melissa J Bak, Steven D Tran, Anna Pawlowski, Prasanth Nannapaneni, Luke V Rasmussen, Daniel Schneider, Helen K Donnelly, Ankit Agrawal, David Liebovitz, Alexander V Misharin, G R Scott Budinger, Richard G Wunderink, Theresa L Walunas, Catherine A Gao
Objective: Rule-based structured data algorithms and natural language processing (NLP) approaches applied to unstructured clinical notes have limited accuracy and poor generalizability for identifying immunosuppression. Large language models (LLMs) may effectively identify patients with heterogenous types of immunosuppression from unstructured clinical notes. We compared the performance of LLMs applied to unstructured notes for identifying patients with immunosuppressive conditions or immunosuppressive medication use against 2 baselines: (1) structured data algorithms using diagnosis codes and medication orders and (2) NLP approaches applied to unstructured notes.
Materials and methods: We used hospital admission notes from a primary cohort of 827 intensive care unit (ICU) patients at Northwestern Memorial Hospital and a validation cohort of 200 ICU patients at Beth Israel Deaconess Medical Center, along with diagnosis codes and medication orders from the primary cohort. We evaluated the performance of structured data algorithms, NLP approaches, and LLMs in identifying 7 immunosuppressive conditions and 6 immunosuppressive medications.
Results: In the primary cohort, structured data algorithms achieved peak F1 scores ranging from 0.30 to 0.97 for identifying immunosuppressive conditions and medications. NLP approaches achieved peak F1 scores ranging from 0 to 1. GPT-4o outperformed or matched structured data algorithms and NLP approaches across all conditions and medications, with F1 scores ranging from 0.51 to 1. GPT-4o also performed impressively in our validation cohort (F1 = 1 for 8/13 variables).
Discussion: LLMs, particularly GPT-4o, outperformed structured data algorithms and NLP approaches in identifying immunosuppressive conditions and medications with robust external validation.
Conclusion: LLMs can be applied for improved cohort identification for research purposes.
{"title":"Large language models accurately identify immunosuppression in intensive care unit patients.","authors":"Vijeeth Guggilla, Mengjia Kang, Melissa J Bak, Steven D Tran, Anna Pawlowski, Prasanth Nannapaneni, Luke V Rasmussen, Daniel Schneider, Helen K Donnelly, Ankit Agrawal, David Liebovitz, Alexander V Misharin, G R Scott Budinger, Richard G Wunderink, Theresa L Walunas, Catherine A Gao","doi":"10.1093/jamia/ocaf141","DOIUrl":"10.1093/jamia/ocaf141","url":null,"abstract":"<p><strong>Objective: </strong>Rule-based structured data algorithms and natural language processing (NLP) approaches applied to unstructured clinical notes have limited accuracy and poor generalizability for identifying immunosuppression. Large language models (LLMs) may effectively identify patients with heterogenous types of immunosuppression from unstructured clinical notes. We compared the performance of LLMs applied to unstructured notes for identifying patients with immunosuppressive conditions or immunosuppressive medication use against 2 baselines: (1) structured data algorithms using diagnosis codes and medication orders and (2) NLP approaches applied to unstructured notes.</p><p><strong>Materials and methods: </strong>We used hospital admission notes from a primary cohort of 827 intensive care unit (ICU) patients at Northwestern Memorial Hospital and a validation cohort of 200 ICU patients at Beth Israel Deaconess Medical Center, along with diagnosis codes and medication orders from the primary cohort. We evaluated the performance of structured data algorithms, NLP approaches, and LLMs in identifying 7 immunosuppressive conditions and 6 immunosuppressive medications.</p><p><strong>Results: </strong>In the primary cohort, structured data algorithms achieved peak F1 scores ranging from 0.30 to 0.97 for identifying immunosuppressive conditions and medications. NLP approaches achieved peak F1 scores ranging from 0 to 1. GPT-4o outperformed or matched structured data algorithms and NLP approaches across all conditions and medications, with F1 scores ranging from 0.51 to 1. GPT-4o also performed impressively in our validation cohort (F1 = 1 for 8/13 variables).</p><p><strong>Discussion: </strong>LLMs, particularly GPT-4o, outperformed structured data algorithms and NLP approaches in identifying immunosuppressive conditions and medications with robust external validation.</p><p><strong>Conclusion: </strong>LLMs can be applied for improved cohort identification for research purposes.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1888-1898"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12490808/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145114981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objective: To evaluate the accuracy, computational cost, and portability of a new natural language processing (NLP) method for extracting medication information from clinical narratives.
Materials and methods: We propose an original transformer-based architecture for the extraction of entities and their relations pertaining to patients' medication regimen. First, we used this approach to train and evaluate a model on French clinical notes, using a newly annotated corpus from Hôpitaux Universitaires de Strasbourg. Second, the portability of the approach was assessed by conducting an evaluation on clinical documents in English from the 2018 n2c2 shared task. Information extraction accuracy and computational cost were assessed by comparison with an available method using transformers.
Results: The proposed architecture achieves on the task of relation extraction itself performance that are competitive with the state-of-the-art on both French and English (F-measures 0.82 and 0.96 vs 0.81 and 0.95), but reduces the computational cost by 10. End-to-end (Named Entity recognition and Relation Extraction) F1 performance is 0.69 and 0.82 for French and English corpus.
Discussion: While an existing system developed for English notes was deployed in a French hospital setting with reasonable effort, we found that an alternative architecture offered end-to-end drug information extraction with comparable extraction performance and lower computational impact for both French and English clinical text processing, respectively.
Conclusion: The proposed architecture can be used to extract medication information from clinical text with high performance and low computational cost and consequently suits with usually limited hospital IT resources.
目的:评估一种新的自然语言处理(NLP)方法从临床叙述中提取药物信息的准确性、计算成本和可移植性。材料和方法:我们提出了一种基于变压器的原始架构,用于提取与患者用药方案相关的实体及其关系。首先,我们使用来自Hôpitaux Universitaires de Strasbourg的新注释语料库,使用这种方法来训练和评估法国临床笔记的模型。其次,通过对2018年n2c2共享任务中的英文临床文件进行评估,评估了该方法的可移植性。通过与现有的变压器信息提取方法的比较,评估了信息提取的准确性和计算成本。结果:所提出的架构在关系提取任务本身的性能上达到了与法语和英语的最先进技术相竞争的水平(f值为0.82和0.96 vs 0.81和0.95),但计算成本降低了10。法语和英语语料库的端到端(命名实体识别和关系提取)F1性能分别为0.69和0.82。讨论:虽然为英语笔记开发的现有系统在法国医院环境中部署了合理的努力,但我们发现另一种架构提供端到端的药物信息提取,其提取性能与法语和英语临床文本处理相当,并且计算影响更低。结论:所提出的体系结构能够高效、低计算成本地从临床文本中提取药物信息,适合医院通常有限的IT资源。
{"title":"Efficient extraction of medication information from clinical notes: an evaluation in 2 languages.","authors":"Thibaut Fabacher, Erik-André Sauleau, Emmanuelle Arcay, Bineta Faye, Maxime Alter, Archia Chahard, Nathan Miraillet, Adrien Coulet, Aurélie Névéol","doi":"10.1093/jamia/ocaf113","DOIUrl":"10.1093/jamia/ocaf113","url":null,"abstract":"<p><strong>Objective: </strong>To evaluate the accuracy, computational cost, and portability of a new natural language processing (NLP) method for extracting medication information from clinical narratives.</p><p><strong>Materials and methods: </strong>We propose an original transformer-based architecture for the extraction of entities and their relations pertaining to patients' medication regimen. First, we used this approach to train and evaluate a model on French clinical notes, using a newly annotated corpus from Hôpitaux Universitaires de Strasbourg. Second, the portability of the approach was assessed by conducting an evaluation on clinical documents in English from the 2018 n2c2 shared task. Information extraction accuracy and computational cost were assessed by comparison with an available method using transformers.</p><p><strong>Results: </strong>The proposed architecture achieves on the task of relation extraction itself performance that are competitive with the state-of-the-art on both French and English (F-measures 0.82 and 0.96 vs 0.81 and 0.95), but reduces the computational cost by 10. End-to-end (Named Entity recognition and Relation Extraction) F1 performance is 0.69 and 0.82 for French and English corpus.</p><p><strong>Discussion: </strong>While an existing system developed for English notes was deployed in a French hospital setting with reasonable effort, we found that an alternative architecture offered end-to-end drug information extraction with comparable extraction performance and lower computational impact for both French and English clinical text processing, respectively.</p><p><strong>Conclusion: </strong>The proposed architecture can be used to extract medication information from clinical text with high performance and low computational cost and consequently suits with usually limited hospital IT resources.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"1855-1864"},"PeriodicalIF":4.6,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12646380/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145276320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}