Large language models in radiology: Fluctuating performance and decreasing discordance over time

IF 3.2 3区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING European Journal of Radiology Pub Date : 2024-11-20 DOI:10.1016/j.ejrad.2024.111842
Mitul Gupta , John Virostko , Christopher Kaufmann
{"title":"Large language models in radiology: Fluctuating performance and decreasing discordance over time","authors":"Mitul Gupta ,&nbsp;John Virostko ,&nbsp;Christopher Kaufmann","doi":"10.1016/j.ejrad.2024.111842","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Since the introduction of large language models (LLMs), near expert level performance in medical specialties such as radiology has been demonstrated. However, there is limited to no comparative information of model performance, accuracy, and reliability over time in these medical specialty domains. This study aims to evaluate and monitor the performance and internal reliability of LLMs in radiology over a three-month period.</div></div><div><h3>Methods</h3><div>LLMs (GPT-4, GPT-3.5, Claude, and Google Bard) were queried monthly from November 2023 to January 2024, utilizing ACR Diagnostic in Training Exam (DXIT) practice questions. Model overall accuracy and by subspecialty category was assessed over time. Internal consistency was evaluated through answer mismatch or intra-model discordance between trials.</div></div><div><h3>Results</h3><div>GPT-4 had the highest accuracy (78 ± 4.1 %), followed by Google Bard (73 ± 2.9 %), Claude (71 ± 1.5 %), and GPT-3.5 (63 ± 6.9 %). GPT-4 performed significantly better than GPT-3.5 (p = 0.031). Over time, GPT-4′s accuracy trended down (82 % to 74 %), while Claude’s accuracy increased (70 % to 73 %). Intra-model discordance rates decreased for all models, indicating improved response consistency. Performance varied by subspecialty, with significant differences in the Chest, Physics, Ultrasound, and Pediatrics sections. Models struggled with questions requiring detailed factual knowledge but performed better on broader interpretive questions.</div></div><div><h3>Conclusion</h3><div>LLMs, except GPT-3.5, performed above 70%, demonstrating substantial subject-specific knowledge. However, performance fluctuated over time, underscoring the need for continuous, radiology-specific standardized benchmarking metrics to gauge LLM reliability before clinical use. This study provides a foundational benchmark for future LLM performance evaluations in radiology.</div></div>","PeriodicalId":12063,"journal":{"name":"European Journal of Radiology","volume":"182 ","pages":"Article 111842"},"PeriodicalIF":3.2000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0720048X24005588","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Objective

Since the introduction of large language models (LLMs), near expert level performance in medical specialties such as radiology has been demonstrated. However, there is limited to no comparative information of model performance, accuracy, and reliability over time in these medical specialty domains. This study aims to evaluate and monitor the performance and internal reliability of LLMs in radiology over a three-month period.

Methods

LLMs (GPT-4, GPT-3.5, Claude, and Google Bard) were queried monthly from November 2023 to January 2024, utilizing ACR Diagnostic in Training Exam (DXIT) practice questions. Model overall accuracy and by subspecialty category was assessed over time. Internal consistency was evaluated through answer mismatch or intra-model discordance between trials.

Results

GPT-4 had the highest accuracy (78 ± 4.1 %), followed by Google Bard (73 ± 2.9 %), Claude (71 ± 1.5 %), and GPT-3.5 (63 ± 6.9 %). GPT-4 performed significantly better than GPT-3.5 (p = 0.031). Over time, GPT-4′s accuracy trended down (82 % to 74 %), while Claude’s accuracy increased (70 % to 73 %). Intra-model discordance rates decreased for all models, indicating improved response consistency. Performance varied by subspecialty, with significant differences in the Chest, Physics, Ultrasound, and Pediatrics sections. Models struggled with questions requiring detailed factual knowledge but performed better on broader interpretive questions.

Conclusion

LLMs, except GPT-3.5, performed above 70%, demonstrating substantial subject-specific knowledge. However, performance fluctuated over time, underscoring the need for continuous, radiology-specific standardized benchmarking metrics to gauge LLM reliability before clinical use. This study provides a foundational benchmark for future LLM performance evaluations in radiology.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
放射学中的大型语言模型:性能随时间波动,不一致性随时间减少
目标自从大型语言模型(LLM)问世以来,在放射学等医学专业领域中已经证明了其接近专家级的性能。然而,在这些医学专业领域,有关模型性能、准确性和可靠性的长期比较信息非常有限,甚至没有。从 2023 年 11 月到 2024 年 1 月,利用 ACR 培训诊断考试(DXIT)练习题,每月对 LLMs(GPT-4、GPT-3.5、Claude 和 Google Bard)进行查询。随着时间的推移,对模型的总体准确性和各亚专科类别的准确性进行了评估。结果 GPT-4 的准确率最高(78 ± 4.1%),其次是 Google Bard(73 ± 2.9%)、Claude(71 ± 1.5%)和 GPT-3.5(63 ± 6.9%)。GPT-4 的表现明显优于 GPT-3.5(p = 0.031)。随着时间的推移,GPT-4 的准确率呈下降趋势(从 82% 降至 74%),而 Claude 的准确率则有所上升(从 70% 升至 73%)。所有模型的模型内不一致率均有所下降,表明响应一致性有所提高。不同亚专科的成绩各不相同,胸腔、物理、超声波和儿科的成绩差异显著。除 GPT-3.5 外,其他LLM 的成绩都在 70% 以上,这表明他们掌握了大量特定学科的知识。但是,随着时间的推移,LLM 的表现会出现波动,这就说明在临床使用之前,需要有持续的、针对放射学的标准化基准指标来衡量 LLM 的可靠性。本研究为未来放射学 LLM 性能评估提供了一个基础基准。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.70
自引率
3.00%
发文量
398
审稿时长
42 days
期刊介绍: European Journal of Radiology is an international journal which aims to communicate to its readers, state-of-the-art information on imaging developments in the form of high quality original research articles and timely reviews on current developments in the field. Its audience includes clinicians at all levels of training including radiology trainees, newly qualified imaging specialists and the experienced radiologist. Its aim is to inform efficient, appropriate and evidence-based imaging practice to the benefit of patients worldwide.
期刊最新文献
Retraction notice to “Magnetic resonance imaging of lumbar intervertebral discs in elderly patients with minor trauma” [Eur. J. Radiol. 70/2 (2009) 352–356] Value of MRI-visible perivascular spaces in predicting levodopa responsiveness of patients with Parkinson’s disease Fascial involvement score on unenhanced CT potentially helps predict complicated appendicitis Large language models in radiology: Fluctuating performance and decreasing discordance over time Cardiac functional assessment using prospectively electrocardiography-triggered computed tomography in children with congenital heart disease: Comparison of radiation dose and image quality between heart rate-dependent single-extended and heart rate-independent dual-focused scans
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1