下载PDF
{"title":"Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models.","authors":"David B Larson, Arogya Koirala, Lina Y Cheuy, Magdalini Paschali, Dave Van Veen, Hye Sun Na, Matthew B Petterson, Zhongnan Fang, Akshay S Chaudhari","doi":"10.1148/radiol.241051","DOIUrl":null,"url":null,"abstract":"<p><p>Background Incomplete clinical histories are a well-known problem in radiology. Previous dedicated quality improvement efforts focusing on reproducible assessments of the completeness of free-text clinical histories have relied on tedious manual analysis. Purpose To adapt and evaluate open-source and closed-source large language models (LLMs) for their ability to automatically extract clinical history elements within imaging orders and to use the best-performing adapted open-source model to assess the completeness of a large sample of clinical histories as a benchmark for clinical practice. Materials and Methods This retrospective single-site study used previously extracted information accompanying CT, MRI, US, and radiography orders from August 2020 to May 2022 at an adult and pediatric emergency department of a 613-bed tertiary academic medical center. Two open-source (Llama 2-7B [Meta], Mistral-7B [Mistral AI]) and one closed-source (GPT-4 Turbo [OpenAI]) LLMs were adapted using prompt engineering, in-context learning, and fine-tuning (open-source only) to extract the elements \"past medical history,\" \"what,\" \"when,\" \"where,\" and \"clinical concern\" from clinical histories. Model performance, interreader agreement using Cohen κ (none to slight, 0.01-0.20; fair, 0.21-0.40; moderate, 0.41-0.60; substantial, 0.61-0.80; almost perfect, 0.81-1.00), and semantic similarity between the models and the adjudicated manual annotations of two board-certified radiologists with 16 and 3 years of postfellowship experience, respectively, were assessed using accuracy, Cohen κ, and BERTScore, an LLM metric that quantifies how well two pieces of text convey the same meaning; 95% CIs were also calculated. The best-performing open-source model was then used to assess completeness on a large dataset of unannotated clinical histories. Results A total of 50 186 clinical histories were included (794 training, 150 validation, 300 initial testing, 48 942 real-world application). Of the two open-source models, Mistral-7B outperformed Llama 2-7B in assessing completeness and was further fine-tuned. Both Mistral-7B and GPT-4 Turbo showed substantial overall agreement with radiologists (mean κ, 0.73 [95% CI: 0.67, 0.78] to 0.77 [95% CI: 0.71, 0.82]) and adjudicated annotations (mean BERTScore, 0.96 [95% CI: 0.96, 0.97] for both models; <i>P</i> = .38). Mistral-7B also rivaled GPT-4 Turbo in performance (weighted overall mean accuracy, 91% [95% CI: 89, 93] vs 92% [95% CI: 90, 94]; <i>P</i> = .31) despite being a smaller model. Using Mistral-7B, 26.2% (12 803 of 48 942) of unannotated clinical histories were found to contain all five elements. Conclusion An easily deployable fine-tuned open-source LLM (Mistral-7B), rivaling GPT-4 Turbo in performance, could effectively extract clinical history elements with substantial agreement with radiologists and produce a benchmark for completeness of a large sample of clinical histories. The model and code will be fully open-sourced. © RSNA, 2025 <i>Supplemental material is available for this article.</i></p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"314 2","pages":"e241051"},"PeriodicalIF":15.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11868845/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.241051","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
引用
批量引用
Abstract
Background Incomplete clinical histories are a well-known problem in radiology. Previous dedicated quality improvement efforts focusing on reproducible assessments of the completeness of free-text clinical histories have relied on tedious manual analysis. Purpose To adapt and evaluate open-source and closed-source large language models (LLMs) for their ability to automatically extract clinical history elements within imaging orders and to use the best-performing adapted open-source model to assess the completeness of a large sample of clinical histories as a benchmark for clinical practice. Materials and Methods This retrospective single-site study used previously extracted information accompanying CT, MRI, US, and radiography orders from August 2020 to May 2022 at an adult and pediatric emergency department of a 613-bed tertiary academic medical center. Two open-source (Llama 2-7B [Meta], Mistral-7B [Mistral AI]) and one closed-source (GPT-4 Turbo [OpenAI]) LLMs were adapted using prompt engineering, in-context learning, and fine-tuning (open-source only) to extract the elements "past medical history," "what," "when," "where," and "clinical concern" from clinical histories. Model performance, interreader agreement using Cohen κ (none to slight, 0.01-0.20; fair, 0.21-0.40; moderate, 0.41-0.60; substantial, 0.61-0.80; almost perfect, 0.81-1.00), and semantic similarity between the models and the adjudicated manual annotations of two board-certified radiologists with 16 and 3 years of postfellowship experience, respectively, were assessed using accuracy, Cohen κ, and BERTScore, an LLM metric that quantifies how well two pieces of text convey the same meaning; 95% CIs were also calculated. The best-performing open-source model was then used to assess completeness on a large dataset of unannotated clinical histories. Results A total of 50 186 clinical histories were included (794 training, 150 validation, 300 initial testing, 48 942 real-world application). Of the two open-source models, Mistral-7B outperformed Llama 2-7B in assessing completeness and was further fine-tuned. Both Mistral-7B and GPT-4 Turbo showed substantial overall agreement with radiologists (mean κ, 0.73 [95% CI: 0.67, 0.78] to 0.77 [95% CI: 0.71, 0.82]) and adjudicated annotations (mean BERTScore, 0.96 [95% CI: 0.96, 0.97] for both models; P = .38). Mistral-7B also rivaled GPT-4 Turbo in performance (weighted overall mean accuracy, 91% [95% CI: 89, 93] vs 92% [95% CI: 90, 94]; P = .31) despite being a smaller model. Using Mistral-7B, 26.2% (12 803 of 48 942) of unannotated clinical histories were found to contain all five elements. Conclusion An easily deployable fine-tuned open-source LLM (Mistral-7B), rivaling GPT-4 Turbo in performance, could effectively extract clinical history elements with substantial agreement with radiologists and produce a benchmark for completeness of a large sample of clinical histories. The model and code will be fully open-sourced. © RSNA, 2025 Supplemental material is available for this article.
使用开源和闭源大型语言模型评估临床病史的完整性。
背景不完整的临床病史是放射学中一个众所周知的问题。以前专门的质量改进工作侧重于自由文本临床病史完整性的可重复评估,依赖于繁琐的人工分析。目的适应和评估开源和闭源大型语言模型(LLMs)在自动提取影像学命令中的临床病史元素的能力,并使用性能最佳的适应开源模型来评估大样本临床病史的完整性,作为临床实践的基准。材料和方法本回顾性单点研究使用了2020年8月至2022年5月在一家拥有613张床位的三级学术医疗中心的成人和儿科急诊科提取的CT、MRI、US和x线片医嘱的相关信息。两个开源(Llama 2-7B [Meta], Mistral- 7b [Mistral AI])和一个闭源(GPT-4 Turbo [OpenAI]) llm采用即时工程,上下文学习和微调(仅开放源代码),从临床病史中提取“过去病史”,“什么”,“何时”,“何地”和“临床关注”等元素。模型性能,使用Cohen κ(无到轻微,0.01-0.20;公平的,0.21 - -0.40;温和的,0.41 - -0.60;实质性的,0.61 - -0.80;几乎完美,0.81-1.00),并且模型与分别具有16年和3年博士后经验的两名委员会认证放射科医生的评审手册注释之间的语义相似性使用准确性,Cohen κ和BERTScore进行评估,BERTScore是一种量化两段文本传达相同含义的LLM度量标准;并计算95% ci。然后使用性能最好的开源模型来评估未注释临床病史的大型数据集的完整性。结果共纳入临床病史50186份,其中培训794份,验证150份,初次测试300份,实际应用48 942份。在两个开源模型中,Mistral-7B在评估完整性和进一步微调方面优于Llama 2-7B。两种模型的Mistral-7B和GPT-4 Turbo与放射科医生(平均κ值为0.73 [95% CI: 0.67, 0.78]至0.77 [95% CI: 0.71, 0.82])和判定注释(平均BERTScore, 0.96 [95% CI: 0.96, 0.97])的总体结果基本一致;P = .38)。Mistral-7B在性能上也与GPT-4 Turbo相媲美(加权总体平均准确率为91% [95% CI: 89, 93] vs 92% [95% CI: 90, 94];P = .31),尽管它是一个较小的模型。使用Mistral-7B,发现26.2%(48942例中的12 803例)的未注释的临床病史包含所有五种元素。一个易于部署的微调开源LLM (Mistral-7B),在性能上可以与GPT-4 Turbo相媲美,可以有效地提取与放射科医生基本一致的临床病史元素,并为大量临床病史样本的完整性提供基准。模型和代码将完全开源。©RSNA, 2025本文可获得补充材料。
本文章由计算机程序翻译,如有差异,请以英文原文为准。