Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.

IF 4.6 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of the American Medical Informatics Association Pub Date : 2025-05-01 DOI:10.1093/jamia/ocaf048

Cheligeer Cheligeer, Danielle A Southern, Jun Yan, Guosong Wu, Jie Pan, Seungwon Lee, Elliot A Martin, Hamed Jafarpour, Cathy A Eastwood, Yong Zeng, Hude Quan

{"title":"Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism.","authors":"Cheligeer Cheligeer, Danielle A Southern, Jun Yan, Guosong Wu, Jie Pan, Seungwon Lee, Elliot A Martin, Hamed Jafarpour, Cathy A Eastwood, Yong Zeng, Hude Quan","doi":"10.1093/jamia/ocaf048","DOIUrl":null,"url":null,"abstract":"Objectives: Adverse event detection from Electronic Medical Records (EMRs) is challenging due to the low incidence of the event, variability in clinical documentation, and the complexity of data formats. Pulmonary embolism as an adverse event (PEAE) is particularly difficult to identify using existing approaches. This study aims to develop and evaluate a Large Language Model (LLM)-based framework for detecting PEAE from unstructured narrative data in EMRs.Materials and methods: We conducted a chart review of adult patients (aged 18-100) admitted to tertiary-care hospitals in Calgary, Alberta, Canada, between 2017-2022. We developed an LLM-based detection framework consisting of three modules: evidence extraction (implementing both keyword-based and semantic similarity-based filtering methods), discharge information extraction (focusing on six key clinical sections), and PEAE detection. Four open-source LLMs (Llama3, Mistral-7B, Gemma, and Phi-3) were evaluated using positive predictive value, sensitivity, specificity, and F1-score. Model performance for population-level surveillance was assessed at yearly, quarterly, and monthly granularities.Results: The chart review included 10 066 patients, with 40 cases of PEAE identified (0.4% prevalence). All four LLMs demonstrated high sensitivity (87.5-100%) and specificity (94.9-98.9%) across different experimental conditions. Gemma achieved the highest F1-score (28.11%) using keyword-based retrieval with discharge summary inclusion, along with 98.4% specificity, 87.5% sensitivity, and 99.95% negative predictive value. Keyword-based filtering reduced the median chunks per patient from 789 to 310, while semantic filtering further reduced this to 9 chunks. Including discharge summaries improved performance metrics across most models. For population-level surveillance, all models showed strong correlation with actual PEAE trends at yearly granularity (r=0.92-0.99), with Llama3 achieving the highest correlation (0.988).Discussion: The results of our method for PEAE detection using EMR notes demonstrate high sensitivity and specificity across all four tested LLMs, indicating strong performance in distinguishing PEAE from non-PEAE cases. However, the low incidence rate of PEAE contributed to a lower PPV. The keyword-based chunking approach consistently outperformed semantic similarity-based methods, achieving higher F1 scores and PPV, underscoring the importance of domain knowledge in text segmentation. Including discharge summaries further enhanced performance metrics. Our population-based analysis revealed better performance for yearly trends compared to monthly granularity, suggesting the framework's utility for long-term surveillance despite dataset imbalance. Error analysis identified contextual misinterpretation, terminology confusion, and preprocessing limitations as key challenges for future improvement.Conclusions: Our proposed method demonstrates that LLMs can effectively detect PEAE from narrative EMRs with high sensitivity and specificity. While these models serve as effective screening tools to exclude non-PEAE cases, their lower PPV indicates they cannot be relied upon solely for definitive PEAE identification. Further chart review remains necessary for confirmation. Future work should focus on improving contextual understanding, medical terminology interpretation, and exploring advanced prompting techniques to enhance precision in adverse event detection from EMRs.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"876-884"},"PeriodicalIF":4.6000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012340/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf048","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: Adverse event detection from Electronic Medical Records (EMRs) is challenging due to the low incidence of the event, variability in clinical documentation, and the complexity of data formats. Pulmonary embolism as an adverse event (PEAE) is particularly difficult to identify using existing approaches. This study aims to develop and evaluate a Large Language Model (LLM)-based framework for detecting PEAE from unstructured narrative data in EMRs.

Materials and methods: We conducted a chart review of adult patients (aged 18-100) admitted to tertiary-care hospitals in Calgary, Alberta, Canada, between 2017-2022. We developed an LLM-based detection framework consisting of three modules: evidence extraction (implementing both keyword-based and semantic similarity-based filtering methods), discharge information extraction (focusing on six key clinical sections), and PEAE detection. Four open-source LLMs (Llama3, Mistral-7B, Gemma, and Phi-3) were evaluated using positive predictive value, sensitivity, specificity, and F1-score. Model performance for population-level surveillance was assessed at yearly, quarterly, and monthly granularities.

Results: The chart review included 10 066 patients, with 40 cases of PEAE identified (0.4% prevalence). All four LLMs demonstrated high sensitivity (87.5-100%) and specificity (94.9-98.9%) across different experimental conditions. Gemma achieved the highest F1-score (28.11%) using keyword-based retrieval with discharge summary inclusion, along with 98.4% specificity, 87.5% sensitivity, and 99.95% negative predictive value. Keyword-based filtering reduced the median chunks per patient from 789 to 310, while semantic filtering further reduced this to 9 chunks. Including discharge summaries improved performance metrics across most models. For population-level surveillance, all models showed strong correlation with actual PEAE trends at yearly granularity (r=0.92-0.99), with Llama3 achieving the highest correlation (0.988).

Discussion: The results of our method for PEAE detection using EMR notes demonstrate high sensitivity and specificity across all four tested LLMs, indicating strong performance in distinguishing PEAE from non-PEAE cases. However, the low incidence rate of PEAE contributed to a lower PPV. The keyword-based chunking approach consistently outperformed semantic similarity-based methods, achieving higher F1 scores and PPV, underscoring the importance of domain knowledge in text segmentation. Including discharge summaries further enhanced performance metrics. Our population-based analysis revealed better performance for yearly trends compared to monthly granularity, suggesting the framework's utility for long-term surveillance despite dataset imbalance. Error analysis identified contextual misinterpretation, terminology confusion, and preprocessing limitations as key challenges for future improvement.

Conclusions: Our proposed method demonstrates that LLMs can effectively detect PEAE from narrative EMRs with high sensitivity and specificity. While these models serve as effective screening tools to exclude non-PEAE cases, their lower PPV indicates they cannot be relied upon solely for definitive PEAE identification. Further chart review remains necessary for confirmation. Future work should focus on improving contextual understanding, medical terminology interpretation, and exploring advanced prompting techniques to enhance precision in adverse event detection from EMRs.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用大语言模型检测医院获得性疾病：肺栓塞的实证研究。

目的：从电子病历（emr）中检测不良事件具有挑战性，因为该事件的发生率低，临床文件的可变性以及数据格式的复杂性。肺栓塞作为不良事件（PEAE）使用现有方法尤其难以识别。本研究旨在开发和评估一个基于大语言模型（LLM）的框架，用于从emr中的非结构化叙事数据中检测PEAE。材料和方法：我们对2017-2022年间在加拿大阿尔伯塔省卡尔加里三级医院住院的成年患者（18-100岁）进行了图表回顾。我们开发了一个基于llm的检测框架，该框架由三个模块组成：证据提取（实现基于关键字和基于语义相似度的过滤方法）、出院信息提取（关注六个关键临床科室）和PEAE检测。采用阳性预测值、敏感性、特异性和f1评分对4个开源llm （Llama3、Mistral-7B、Gemma和Phi-3）进行评估。以年度、季度和月度为粒度评估人口水平监测的模型性能。结果：图表回顾包括10 066例患者，其中40例确诊为PEAE（患病率0.4%）。在不同的实验条件下，四种llm均具有较高的灵敏度（87.5-100%）和特异性（94.9-98.9%）。使用基于关键词的检索和出院汇总纳入，Gemma获得了最高的f1评分（28.11%），特异性为98.4%，敏感性为87.5%，阴性预测值为99.95%。基于关键词的过滤将每位患者的中位数数据块从789块减少到310块，而语义过滤进一步将其减少到9块。包括放电摘要改善了大多数模型的性能指标。在种群水平监测中，所有模型均与PEAE年粒度的实际趋势具有较强的相关性（r=0.92-0.99），其中Llama3的相关性最高（0.988）。讨论：我们使用EMR笔记检测PEAE的方法的结果表明，在所有四种测试的llm中都具有高灵敏度和特异性，表明在区分PEAE和非PEAE病例方面具有很强的性能。然而，PEAE的低发病率导致PPV较低。基于关键字的分块方法始终优于基于语义相似度的方法，获得更高的F1分数和PPV，强调了领域知识在文本分割中的重要性。包括出院总结进一步增强了绩效指标。我们基于人口的分析显示，与月度粒度相比，年度趋势的表现更好，这表明尽管数据集不平衡，该框架仍可用于长期监测。错误分析确定了上下文误解、术语混淆和预处理限制是未来改进的主要挑战。结论：我们提出的方法表明llm可以有效地从叙事emr中检测PEAE，具有较高的灵敏度和特异性。虽然这些模型是排除非PEAE病例的有效筛选工具，但它们较低的PPV表明它们不能完全依赖于确定的PEAE鉴定。需要进一步的海图审查才能确认。未来的工作应侧重于提高上下文理解、医学术语解释和探索先进的提示技术，以提高电子病历不良事件检测的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.