Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.

IF 3.8 2区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Academic Radiology Pub Date : 2024-09-30 DOI:10.1016/j.acra.2024.09.041
Elif Can, Wibke Uller, Katharina Vogt, Michael C Doppler, Felix Busch, Nadine Bayerl, Stephan Ellmann, Avan Kader, Aboelyazid Elkilany, Marcus R Makowski, Keno K Bressem, Lisa C Adams
{"title":"Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.","authors":"Elif Can, Wibke Uller, Katharina Vogt, Michael C Doppler, Felix Busch, Nadine Bayerl, Stephan Ellmann, Avan Kader, Aboelyazid Elkilany, Marcus R Makowski, Keno K Bressem, Lisa C Adams","doi":"10.1016/j.acra.2024.09.041","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports.</p><p><strong>Methods: </strong>Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis.</p><p><strong>Results: </strong>Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84).</p><p><strong>Conclusions: </strong>GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation.</p><p><strong>Clinical relevance/applications: </strong>With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.</p>","PeriodicalId":50928,"journal":{"name":"Academic Radiology","volume":null,"pages":null},"PeriodicalIF":3.8000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.acra.2024.09.041","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports.

Methods: Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis.

Results: Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84).

Conclusions: GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation.

Clinical relevance/applications: With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
简化介入放射学报告的大语言模型:比较分析。
目的:定量和定性地评估和比较领先的大型语言模型(LLM),包括专利模型(GPT-4、GPT-3.5 Turbo、Claude-3-Opus 和 Gemini Ultra)和开源模型(Mistral-7b 和 Mistral-8×7b)在简化 109 份介入放射学报告方面的性能:采用李克特五点量表对准确性、完整性、清晰度、临床相关性、自然度和错误率(包括破坏信任和治疗后不当行为错误)进行定性评估。定量可读性采用 Flesch Reading Ease (FRE)、Flesch-Kincaid Grade Level (FKGL)、SMOG Index 和 Dale-Chall Readability Score (DCRS) 进行评估。统计分析采用配对 t 检验和 Bonferroni 校正 p 值:定性评估结果表明,GPT-4 和 Claude-3-Opus 在所有评估指标上都没有明显差异(所有 Bonferroni 校正 p 值:p = 1),而在五个定性指标上,它们的表现优于其他评估模型(p 结论:GPT-4 和 Claude-3-Opus 在所有评估指标上都没有明显差异(所有 Bonferroni 校正 p 值:p = 1),而在五个定性指标上,它们的表现优于其他评估模型:GPT-4和Claude-3-Opus在生成简化的IR报告方面表现优异,但所有模型都存在误差,包括破坏信任的误差,这凸显了在临床应用前进一步完善和验证的必要性:临床相关性/应用:随着介入放射学(IR)手术的日益复杂和电子健康记录的日益普及,简化IR报告对于改善患者理解和临床决策至关重要。本研究深入探讨了各种 LLM 在重写 IR 报告方面的性能,有助于为以患者为中心的临床应用选择最合适的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Academic Radiology
Academic Radiology 医学-核医学
CiteScore
7.60
自引率
10.40%
发文量
432
审稿时长
18 days
期刊介绍: Academic Radiology publishes original reports of clinical and laboratory investigations in diagnostic imaging, the diagnostic use of radioactive isotopes, computed tomography, positron emission tomography, magnetic resonance imaging, ultrasound, digital subtraction angiography, image-guided interventions and related techniques. It also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.
期刊最新文献
Clinical Impact of Radiologist's Alert System on Patient Care for High-risk Incidental CT Findings: A Machine Learning-Based Risk Factor Analysis. Magnetic Resonance Imaging-Based Radiomics of Axial and Sagittal Orientation in Pregnant Patients with Suspected Placenta Accreta Spectrum. Navigating a Radiology Conference: A Comprehensive Guide for Learners. Radiomics Combined with ACR TI-RADS for Thyroid Nodules: Diagnostic Performance, Unnecessary Biopsy Rate, and Nomogram Construction. The association between FLAIR vascular hyperintensities and outcomes in patients with border zone infarcts treated with medical therapy may vary with the infarct subtype.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1