一项试验性可行性研究,比较了大型语言模型从来自爱尔兰人群的重症监护病房病人文本记录中提取关键信息的能力。

IF 2.8 Q2 CRITICAL CARE MEDICINE Intensive Care Medicine Experimental Pub Date : 2024-08-16 DOI:10.1186/s40635-024-00656-1
Emma Urquhart, John Ryan, Sean Hartigan, Ciprian Nita, Ciara Hanley, Peter Moran, John Bates, Rachel Jooste, Conor Judge, John G Laffey, Michael G Madden, Bairbre A McNicholas
{"title":"一项试验性可行性研究,比较了大型语言模型从来自爱尔兰人群的重症监护病房病人文本记录中提取关键信息的能力。","authors":"Emma Urquhart, John Ryan, Sean Hartigan, Ciprian Nita, Ciara Hanley, Peter Moran, John Bates, Rachel Jooste, Conor Judge, John G Laffey, Michael G Madden, Bairbre A McNicholas","doi":"10.1186/s40635-024-00656-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence, through improved data management and automated summarisation, has the potential to enhance intensive care unit (ICU) care. Large language models (LLMs) can interrogate and summarise large volumes of medical notes to create succinct discharge summaries. In this study, we aim to investigate the potential of LLMs to accurately and concisely synthesise ICU discharge summaries.</p><p><strong>Methods: </strong>Anonymised clinical notes from ICU admissions were used to train and validate a prompting structure in three separate LLMs (ChatGPT, GPT-4 API and Llama 2) to generate concise clinical summaries. Summaries were adjudicated by staff intensivists on ability to identify and appropriately order a pre-defined list of important clinical events as well as readability, organisation, succinctness, and overall rank.</p><p><strong>Results: </strong>In the development phase, text from five ICU episodes was used to develop a series of prompts to best capture clinical summaries. In the testing phase, a summary produced by each LLM from an additional six ICU episodes was utilised for evaluation. Overall ability to identify a pre-defined list of important clinical events in the summary was 41.5 ± 15.2% for GPT-4 API, 19.2 ± 20.9% for ChatGPT and 16.5 ± 14.1% for Llama2 (p = 0.002). GPT-4 API followed by ChatGPT had the highest score to appropriately order a pre-defined list of important clinical events in the summary as well as readability, organisation, succinctness, and overall rank, whilst Llama2 scored lowest for all. GPT-4 API produced minor hallucinations, which were not present in the other models.</p><p><strong>Conclusion: </strong>Differences exist in large language model performance in readability, organisation, succinctness, and sequencing of clinical events compared to others. All encountered issues with narrative coherence and omitted key clinical data and only moderately captured all clinically meaningful data in the correct order. However, these technologies suggest future potential for creating succinct discharge summaries.</p>","PeriodicalId":13750,"journal":{"name":"Intensive Care Medicine Experimental","volume":"12 1","pages":"71"},"PeriodicalIF":2.8000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11327225/pdf/","citationCount":"0","resultStr":"{\"title\":\"A pilot feasibility study comparing large language models in extracting key information from ICU patient text records from an Irish population.\",\"authors\":\"Emma Urquhart, John Ryan, Sean Hartigan, Ciprian Nita, Ciara Hanley, Peter Moran, John Bates, Rachel Jooste, Conor Judge, John G Laffey, Michael G Madden, Bairbre A McNicholas\",\"doi\":\"10.1186/s40635-024-00656-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Artificial intelligence, through improved data management and automated summarisation, has the potential to enhance intensive care unit (ICU) care. Large language models (LLMs) can interrogate and summarise large volumes of medical notes to create succinct discharge summaries. In this study, we aim to investigate the potential of LLMs to accurately and concisely synthesise ICU discharge summaries.</p><p><strong>Methods: </strong>Anonymised clinical notes from ICU admissions were used to train and validate a prompting structure in three separate LLMs (ChatGPT, GPT-4 API and Llama 2) to generate concise clinical summaries. Summaries were adjudicated by staff intensivists on ability to identify and appropriately order a pre-defined list of important clinical events as well as readability, organisation, succinctness, and overall rank.</p><p><strong>Results: </strong>In the development phase, text from five ICU episodes was used to develop a series of prompts to best capture clinical summaries. In the testing phase, a summary produced by each LLM from an additional six ICU episodes was utilised for evaluation. Overall ability to identify a pre-defined list of important clinical events in the summary was 41.5 ± 15.2% for GPT-4 API, 19.2 ± 20.9% for ChatGPT and 16.5 ± 14.1% for Llama2 (p = 0.002). GPT-4 API followed by ChatGPT had the highest score to appropriately order a pre-defined list of important clinical events in the summary as well as readability, organisation, succinctness, and overall rank, whilst Llama2 scored lowest for all. GPT-4 API produced minor hallucinations, which were not present in the other models.</p><p><strong>Conclusion: </strong>Differences exist in large language model performance in readability, organisation, succinctness, and sequencing of clinical events compared to others. All encountered issues with narrative coherence and omitted key clinical data and only moderately captured all clinically meaningful data in the correct order. However, these technologies suggest future potential for creating succinct discharge summaries.</p>\",\"PeriodicalId\":13750,\"journal\":{\"name\":\"Intensive Care Medicine Experimental\",\"volume\":\"12 1\",\"pages\":\"71\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11327225/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intensive Care Medicine Experimental\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s40635-024-00656-1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CRITICAL CARE MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intensive Care Medicine Experimental","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s40635-024-00656-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}
引用次数: 0

摘要

背景:人工智能通过改进数据管理和自动总结,有可能提高重症监护室(ICU)的护理水平。大语言模型(LLMs)可以查询和总结大量医疗记录,从而创建简洁的出院总结。在本研究中,我们旨在研究大语言模型在准确、简洁地综合 ICU 出院摘要方面的潜力:方法:使用来自 ICU 入院的匿名临床笔记来训练和验证三个独立 LLM(ChatGPT、GPT-4 API 和 Llama 2)中的提示结构,以生成简洁的临床摘要。摘要由重症医学专家根据识别和适当排列预定义的重要临床事件列表的能力以及可读性、条理性、简洁性和总体等级进行评判:在开发阶段,我们使用了五次重症监护病房事件的文本来开发一系列提示,以最好地捕捉临床摘要。在测试阶段,利用每名 LLM 从另外六个 ICU 病例中编写的摘要进行评估。在摘要中识别预定义重要临床事件列表的总体能力,GPT-4 API 为 41.5 ± 15.2%,ChatGPT 为 19.2 ± 20.9%,Llama2 为 16.5 ± 14.1%(p = 0.002)。GPT-4 API 的得分最高,其次是 ChatGPT,在摘要中对预定义的重要临床事件列表进行适当排序以及可读性、条理性、简洁性和总体等级方面的得分也最高,而 Llama2 在所有方面的得分都最低。GPT-4 API 会产生轻微的幻觉,而其他模型则没有:结论:与其他模型相比,大语言模型在临床事件的可读性、组织性、简洁性和排序方面的表现存在差异。所有模型在叙述连贯性和遗漏关键临床数据方面都存在问题,仅在一定程度上按正确顺序捕获了所有有临床意义的数据。不过,这些技术表明未来有潜力创建简洁的出院摘要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A pilot feasibility study comparing large language models in extracting key information from ICU patient text records from an Irish population.

Background: Artificial intelligence, through improved data management and automated summarisation, has the potential to enhance intensive care unit (ICU) care. Large language models (LLMs) can interrogate and summarise large volumes of medical notes to create succinct discharge summaries. In this study, we aim to investigate the potential of LLMs to accurately and concisely synthesise ICU discharge summaries.

Methods: Anonymised clinical notes from ICU admissions were used to train and validate a prompting structure in three separate LLMs (ChatGPT, GPT-4 API and Llama 2) to generate concise clinical summaries. Summaries were adjudicated by staff intensivists on ability to identify and appropriately order a pre-defined list of important clinical events as well as readability, organisation, succinctness, and overall rank.

Results: In the development phase, text from five ICU episodes was used to develop a series of prompts to best capture clinical summaries. In the testing phase, a summary produced by each LLM from an additional six ICU episodes was utilised for evaluation. Overall ability to identify a pre-defined list of important clinical events in the summary was 41.5 ± 15.2% for GPT-4 API, 19.2 ± 20.9% for ChatGPT and 16.5 ± 14.1% for Llama2 (p = 0.002). GPT-4 API followed by ChatGPT had the highest score to appropriately order a pre-defined list of important clinical events in the summary as well as readability, organisation, succinctness, and overall rank, whilst Llama2 scored lowest for all. GPT-4 API produced minor hallucinations, which were not present in the other models.

Conclusion: Differences exist in large language model performance in readability, organisation, succinctness, and sequencing of clinical events compared to others. All encountered issues with narrative coherence and omitted key clinical data and only moderately captured all clinically meaningful data in the correct order. However, these technologies suggest future potential for creating succinct discharge summaries.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Intensive Care Medicine Experimental
Intensive Care Medicine Experimental CRITICAL CARE MEDICINE-
CiteScore
5.10
自引率
2.90%
发文量
48
审稿时长
13 weeks
期刊最新文献
Predictors of intradialytic hypotension in critically ill patients undergoing kidney replacement therapy: a systematic review. Is passive leg raising clinically useful in predicting intradialytic hypotension? Largely ignored-but pathogenetically significant: ambient temperature in rodent sepsis models. The development of a C5.0 machine learning model in a limited data set to predict early mortality in patients with ARDS undergoing an initial session of prone positioning. A new method to predict return of spontaneous circulation by peripheral intravenous analysis during cardiopulmonary resuscitation: a rat model pilot study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1