A Prospective Comparison of Large Language Models for Early Prediction of Sepsis.

Supreeth P Shashikumar, Shamim Nemati
{"title":"A Prospective Comparison of Large Language Models for Early Prediction of Sepsis.","authors":"Supreeth P Shashikumar, Shamim Nemati","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>We present a comparative study on the performance of two popular open-source large language models for early prediction of sepsis: Llama-3 8B and Mixtral 8x7B. The primary goal was to determine whether a smaller model could achieve comparable predictive accuracy to a significantly larger model in the context of sepsis prediction using clinical data.Our proposed LLM-based sepsis prediction system, COMPOSER-LLM, enhances the previously published COMPOSER model, which utilizes structured EHR data to generate hourly sepsis risk scores. The new system incorporates an LLM-based approach to extract sepsis-related clinical signs and symptoms from unstructured clinical notes. For scores falling within high-uncertainty prediction regions, particularly those near the decision threshold, the system uses the LLM to draw additional clinical context from patient notes; thereby enhancing the model's predictive accuracy in challenging diagnostic scenarios.A total of 2,074 patient encounters admitted to the Emergency Department at two hospitals within the University of California San Diego Health system were used for model evaluation in this study. Our findings reveal that the Llama-3 8B model based system (COMPOSER-LLMLlama) achieved a sensitivity of 70.3%, positive predictive value (PPV) of 32.5%, F-1 score of 44.4% and false alarms per patient hour (FAPH) of 0.0194, closely matching the performance of the larger Mixtral 8x7B model based system (COMPOSER-LLMmixtral) which achieved a sensitivity of 72.1%, PPV of 31.9%, F-1 score of 44.2% and FAPH of 0.020. When prospectively evaluated, COMPOSER-LLMLlama demonstrated similar performance to the COMPOSER-LLMmixtral pipeline, with a sensitivity of 68.7%, PPV of 36.6%, F-1 score of 47.7% and FAPH of 0.019 vs. sensitivity of 70.5%, PPV of 36.3%, F-1 score of 47.9% and FAPH of 0.020. This result indicates that, for extraction of clinical signs and symptoms from unstructured clinical notes to enable early prediction of sepsis, the Llama-3 generation of smaller language models can perform as effectively and more efficiently than larger models. This finding has significant implications for healthcare settings with limited resources.</p>","PeriodicalId":34954,"journal":{"name":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","volume":"30 ","pages":"109-120"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11649013/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

Abstract

We present a comparative study on the performance of two popular open-source large language models for early prediction of sepsis: Llama-3 8B and Mixtral 8x7B. The primary goal was to determine whether a smaller model could achieve comparable predictive accuracy to a significantly larger model in the context of sepsis prediction using clinical data.Our proposed LLM-based sepsis prediction system, COMPOSER-LLM, enhances the previously published COMPOSER model, which utilizes structured EHR data to generate hourly sepsis risk scores. The new system incorporates an LLM-based approach to extract sepsis-related clinical signs and symptoms from unstructured clinical notes. For scores falling within high-uncertainty prediction regions, particularly those near the decision threshold, the system uses the LLM to draw additional clinical context from patient notes; thereby enhancing the model's predictive accuracy in challenging diagnostic scenarios.A total of 2,074 patient encounters admitted to the Emergency Department at two hospitals within the University of California San Diego Health system were used for model evaluation in this study. Our findings reveal that the Llama-3 8B model based system (COMPOSER-LLMLlama) achieved a sensitivity of 70.3%, positive predictive value (PPV) of 32.5%, F-1 score of 44.4% and false alarms per patient hour (FAPH) of 0.0194, closely matching the performance of the larger Mixtral 8x7B model based system (COMPOSER-LLMmixtral) which achieved a sensitivity of 72.1%, PPV of 31.9%, F-1 score of 44.2% and FAPH of 0.020. When prospectively evaluated, COMPOSER-LLMLlama demonstrated similar performance to the COMPOSER-LLMmixtral pipeline, with a sensitivity of 68.7%, PPV of 36.6%, F-1 score of 47.7% and FAPH of 0.019 vs. sensitivity of 70.5%, PPV of 36.3%, F-1 score of 47.9% and FAPH of 0.020. This result indicates that, for extraction of clinical signs and symptoms from unstructured clinical notes to enable early prediction of sepsis, the Llama-3 generation of smaller language models can perform as effectively and more efficiently than larger models. This finding has significant implications for healthcare settings with limited resources.

分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
脓毒症早期预测大型语言模型的前瞻性比较。
我们对两种流行的开源大型语言模型的性能进行了比较研究,用于脓毒症的早期预测:llama - 38b和Mixtral 8x7B。主要目的是确定在脓毒症预测的背景下,使用临床数据确定一个较小的模型是否可以达到与一个显著较大的模型相当的预测准确性。我们提出的基于法学硕士的败血症预测系统COMPOSER- llm增强了先前发表的COMPOSER模型,该模型利用结构化的电子病历数据生成每小时败血症风险评分。新系统结合了基于法学硕士的方法,从非结构化的临床记录中提取败血症相关的临床体征和症状。对于处于高不确定性预测区域的分数,特别是那些接近决策阈值的分数,系统使用LLM从患者笔记中提取额外的临床背景;从而在具有挑战性的诊断场景中提高模型的预测准确性。在本研究中,加州大学圣地亚哥分校卫生系统内两家医院急诊科收治的2,074名患者被用于模型评估。结果表明,基于llama - 38b模型的系统(comser - llmllama)的灵敏度为70.3%,阳性预测值(PPV)为32.5%,F-1评分为44.4%,每病人小时误报率(FAPH)为0.0194,与基于更大的Mixtral 8 × 7b模型的系统(comser - llmmixtral)的灵敏度为72.1%,PPV为31.9%,F-1评分为44.2%,FAPH为0.020的性能非常接近。在前瞻性评价中,COMPOSER-LLMLlama表现出与composer - llmmix管道相似的性能,敏感性为68.7%,PPV为36.6%,F-1评分为47.7%,FAPH为0.019,敏感性为70.5%,PPV为36.3%,F-1评分为47.9%,FAPH为0.020。这一结果表明,对于从非结构化临床记录中提取临床体征和症状以实现脓毒症的早期预测,Llama-3代较小的语言模型可以比较大的模型更有效地执行。这一发现对资源有限的医疗机构具有重要意义。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
4.50
自引率
0.00%
发文量
0
期刊最新文献
Session Introduction: AI and Machine Learning in Clinical Medicine: Generative and Interactive Systems at the Human-Machine Interface. Session Introduction: Overcoming health disparities in precision medicine: Intersectional approaches in precision medicine. Session Introduction: Precision Medicine: Multi-modal and multi-scale methods to promote mechanistic understanding of disease. Social risk factors and cardiovascular risk in obstructive sleep apnea: a systematic assessment of clinical predictors in community health centers. A Visual Analytics Framework for Assessing Interactive AI for Clinical Decision Support.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1