Emma Urquhart, John Ryan, Sean Hartigan, Ciprian Nita, Ciara Hanley, Peter Moran, John Bates, Rachel Jooste, Conor Judge, John G Laffey, Michael G Madden, Bairbre A McNicholas
{"title":"一项试验性可行性研究,比较了大型语言模型从来自爱尔兰人群的重症监护病房病人文本记录中提取关键信息的能力。","authors":"Emma Urquhart, John Ryan, Sean Hartigan, Ciprian Nita, Ciara Hanley, Peter Moran, John Bates, Rachel Jooste, Conor Judge, John G Laffey, Michael G Madden, Bairbre A McNicholas","doi":"10.1186/s40635-024-00656-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence, through improved data management and automated summarisation, has the potential to enhance intensive care unit (ICU) care. Large language models (LLMs) can interrogate and summarise large volumes of medical notes to create succinct discharge summaries. In this study, we aim to investigate the potential of LLMs to accurately and concisely synthesise ICU discharge summaries.</p><p><strong>Methods: </strong>Anonymised clinical notes from ICU admissions were used to train and validate a prompting structure in three separate LLMs (ChatGPT, GPT-4 API and Llama 2) to generate concise clinical summaries. Summaries were adjudicated by staff intensivists on ability to identify and appropriately order a pre-defined list of important clinical events as well as readability, organisation, succinctness, and overall rank.</p><p><strong>Results: </strong>In the development phase, text from five ICU episodes was used to develop a series of prompts to best capture clinical summaries. In the testing phase, a summary produced by each LLM from an additional six ICU episodes was utilised for evaluation. Overall ability to identify a pre-defined list of important clinical events in the summary was 41.5 ± 15.2% for GPT-4 API, 19.2 ± 20.9% for ChatGPT and 16.5 ± 14.1% for Llama2 (p = 0.002). GPT-4 API followed by ChatGPT had the highest score to appropriately order a pre-defined list of important clinical events in the summary as well as readability, organisation, succinctness, and overall rank, whilst Llama2 scored lowest for all. GPT-4 API produced minor hallucinations, which were not present in the other models.</p><p><strong>Conclusion: </strong>Differences exist in large language model performance in readability, organisation, succinctness, and sequencing of clinical events compared to others. All encountered issues with narrative coherence and omitted key clinical data and only moderately captured all clinically meaningful data in the correct order. However, these technologies suggest future potential for creating succinct discharge summaries.</p>","PeriodicalId":13750,"journal":{"name":"Intensive Care Medicine Experimental","volume":"12 1","pages":"71"},"PeriodicalIF":2.8000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11327225/pdf/","citationCount":"0","resultStr":"{\"title\":\"A pilot feasibility study comparing large language models in extracting key information from ICU patient text records from an Irish population.\",\"authors\":\"Emma Urquhart, John Ryan, Sean Hartigan, Ciprian Nita, Ciara Hanley, Peter Moran, John Bates, Rachel Jooste, Conor Judge, John G Laffey, Michael G Madden, Bairbre A McNicholas\",\"doi\":\"10.1186/s40635-024-00656-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Artificial intelligence, through improved data management and automated summarisation, has the potential to enhance intensive care unit (ICU) care. Large language models (LLMs) can interrogate and summarise large volumes of medical notes to create succinct discharge summaries. In this study, we aim to investigate the potential of LLMs to accurately and concisely synthesise ICU discharge summaries.</p><p><strong>Methods: </strong>Anonymised clinical notes from ICU admissions were used to train and validate a prompting structure in three separate LLMs (ChatGPT, GPT-4 API and Llama 2) to generate concise clinical summaries. Summaries were adjudicated by staff intensivists on ability to identify and appropriately order a pre-defined list of important clinical events as well as readability, organisation, succinctness, and overall rank.</p><p><strong>Results: </strong>In the development phase, text from five ICU episodes was used to develop a series of prompts to best capture clinical summaries. In the testing phase, a summary produced by each LLM from an additional six ICU episodes was utilised for evaluation. Overall ability to identify a pre-defined list of important clinical events in the summary was 41.5 ± 15.2% for GPT-4 API, 19.2 ± 20.9% for ChatGPT and 16.5 ± 14.1% for Llama2 (p = 0.002). GPT-4 API followed by ChatGPT had the highest score to appropriately order a pre-defined list of important clinical events in the summary as well as readability, organisation, succinctness, and overall rank, whilst Llama2 scored lowest for all. GPT-4 API produced minor hallucinations, which were not present in the other models.</p><p><strong>Conclusion: </strong>Differences exist in large language model performance in readability, organisation, succinctness, and sequencing of clinical events compared to others. All encountered issues with narrative coherence and omitted key clinical data and only moderately captured all clinically meaningful data in the correct order. However, these technologies suggest future potential for creating succinct discharge summaries.</p>\",\"PeriodicalId\":13750,\"journal\":{\"name\":\"Intensive Care Medicine Experimental\",\"volume\":\"12 1\",\"pages\":\"71\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11327225/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intensive Care Medicine Experimental\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s40635-024-00656-1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CRITICAL CARE MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intensive Care Medicine Experimental","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s40635-024-00656-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}
引用次数: 0
摘要
背景:人工智能通过改进数据管理和自动总结,有可能提高重症监护室(ICU)的护理水平。大语言模型(LLMs)可以查询和总结大量医疗记录,从而创建简洁的出院总结。在本研究中,我们旨在研究大语言模型在准确、简洁地综合 ICU 出院摘要方面的潜力:方法:使用来自 ICU 入院的匿名临床笔记来训练和验证三个独立 LLM(ChatGPT、GPT-4 API 和 Llama 2)中的提示结构,以生成简洁的临床摘要。摘要由重症医学专家根据识别和适当排列预定义的重要临床事件列表的能力以及可读性、条理性、简洁性和总体等级进行评判:在开发阶段,我们使用了五次重症监护病房事件的文本来开发一系列提示,以最好地捕捉临床摘要。在测试阶段,利用每名 LLM 从另外六个 ICU 病例中编写的摘要进行评估。在摘要中识别预定义重要临床事件列表的总体能力,GPT-4 API 为 41.5 ± 15.2%,ChatGPT 为 19.2 ± 20.9%,Llama2 为 16.5 ± 14.1%(p = 0.002)。GPT-4 API 的得分最高,其次是 ChatGPT,在摘要中对预定义的重要临床事件列表进行适当排序以及可读性、条理性、简洁性和总体等级方面的得分也最高,而 Llama2 在所有方面的得分都最低。GPT-4 API 会产生轻微的幻觉,而其他模型则没有:结论:与其他模型相比,大语言模型在临床事件的可读性、组织性、简洁性和排序方面的表现存在差异。所有模型在叙述连贯性和遗漏关键临床数据方面都存在问题,仅在一定程度上按正确顺序捕获了所有有临床意义的数据。不过,这些技术表明未来有潜力创建简洁的出院摘要。
A pilot feasibility study comparing large language models in extracting key information from ICU patient text records from an Irish population.
Background: Artificial intelligence, through improved data management and automated summarisation, has the potential to enhance intensive care unit (ICU) care. Large language models (LLMs) can interrogate and summarise large volumes of medical notes to create succinct discharge summaries. In this study, we aim to investigate the potential of LLMs to accurately and concisely synthesise ICU discharge summaries.
Methods: Anonymised clinical notes from ICU admissions were used to train and validate a prompting structure in three separate LLMs (ChatGPT, GPT-4 API and Llama 2) to generate concise clinical summaries. Summaries were adjudicated by staff intensivists on ability to identify and appropriately order a pre-defined list of important clinical events as well as readability, organisation, succinctness, and overall rank.
Results: In the development phase, text from five ICU episodes was used to develop a series of prompts to best capture clinical summaries. In the testing phase, a summary produced by each LLM from an additional six ICU episodes was utilised for evaluation. Overall ability to identify a pre-defined list of important clinical events in the summary was 41.5 ± 15.2% for GPT-4 API, 19.2 ± 20.9% for ChatGPT and 16.5 ± 14.1% for Llama2 (p = 0.002). GPT-4 API followed by ChatGPT had the highest score to appropriately order a pre-defined list of important clinical events in the summary as well as readability, organisation, succinctness, and overall rank, whilst Llama2 scored lowest for all. GPT-4 API produced minor hallucinations, which were not present in the other models.
Conclusion: Differences exist in large language model performance in readability, organisation, succinctness, and sequencing of clinical events compared to others. All encountered issues with narrative coherence and omitted key clinical data and only moderately captured all clinically meaningful data in the correct order. However, these technologies suggest future potential for creating succinct discharge summaries.