Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs.

IF 12.1 1区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Radiology Pub Date : 2024-12-01 DOI:10.1148/radiol.241668
Pae Sun Suh, Woo Hyun Shim, Chong Hyun Suh, Hwon Heo, Kye Jin Park, Pyeong Hwa Kim, Se Jin Choi, Yura Ahn, Sohee Park, Ho Young Park, Na Eun Oh, Min Woo Han, Sung Tan Cho, Chang-Yun Woo, Hyungjun Park
{"title":"Comparing Large Language Model and Human Reader Accuracy with <i>New England Journal of Medicine</i> Image Challenge Case Image Inputs.","authors":"Pae Sun Suh, Woo Hyun Shim, Chong Hyun Suh, Hwon Heo, Kye Jin Park, Pyeong Hwa Kim, Se Jin Choi, Yura Ahn, Sohee Park, Ho Young Park, Na Eun Oh, Min Woo Han, Sung Tan Cho, Chang-Yun Woo, Hyungjun Park","doi":"10.1148/radiol.241668","DOIUrl":null,"url":null,"abstract":"<p><p>Background Application of multimodal large language models (LLMs) with both textual and visual capabilities has been steadily increasing, but their ability to interpret radiologic images is still doubted. Purpose To evaluate the accuracy of LLMs and compare it with that of human readers with varying levels of experience and to assess the factors affecting LLM accuracy in answering <i>New England Journal of Medicine</i> Image Challenge cases. Materials and Methods Radiologic images of cases from October 13, 2005, to April 18, 2024, were retrospectively reviewed. Using text and image inputs, LLMs (Open AI's GPT-4 Turbo with Vision [GPT-4V] and GPT-4 Omni [GPT-4o], Google's DeepMind Gemini 1.5 Pro, and Anthropic's Claude 3) provided answers. Human readers (seven junior faculty radiologists, two clinicians, one in-training radiologist, and one medical student), blinded to the published answers, also answered. LLM accuracy with and without image inputs and short (cases from 2005 to 2015) versus long text inputs (from 2016 to 2024) was evaluated in subgroup analysis to determine the effect of these factors. Factor analysis was assessed using multivariable logistic regression. Accuracy was compared with generalized estimating equations, with multiple comparisons adjusted by using Bonferroni correction. Results A total of 272 cases were included. GPT-4o achieved the highest overall accuracy among LLMs (59.6%; 162 of 272), outperforming a medical student (47.1%; 128 of 272; <i>P</i> < .001) but not junior faculty (80.9%; 220 of 272; <i>P</i> < .001) or the in-training radiologist (70.2%; 191 of 272; <i>P</i> = .003). GPT-4o exhibited similar accuracy regardless of image inputs (without images vs with images, 54.0% [147 of 272] vs 59.6% [162 of 272], respectively; <i>P</i> = .59). Human reader accuracy was unaffected by text length, whereas LLMs demonstrated higher accuracy with long text inputs (all <i>P</i> < .001). Text input length affected LLM accuracy (odds ratio range, 3.2 [95% CI: 1.9, 5.5] to 6.6 [95% CI: 3.7, 12.0]). Conclusion LLMs demonstrated substantial accuracy with text and image inputs, outperforming a medical student. However, their accuracy decreased with shorter text lengths, regardless of image input. © RSNA, 2024 <i>Supplemental material is available for this article.</i></p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"313 3","pages":"e241668"},"PeriodicalIF":12.1000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.241668","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Background Application of multimodal large language models (LLMs) with both textual and visual capabilities has been steadily increasing, but their ability to interpret radiologic images is still doubted. Purpose To evaluate the accuracy of LLMs and compare it with that of human readers with varying levels of experience and to assess the factors affecting LLM accuracy in answering New England Journal of Medicine Image Challenge cases. Materials and Methods Radiologic images of cases from October 13, 2005, to April 18, 2024, were retrospectively reviewed. Using text and image inputs, LLMs (Open AI's GPT-4 Turbo with Vision [GPT-4V] and GPT-4 Omni [GPT-4o], Google's DeepMind Gemini 1.5 Pro, and Anthropic's Claude 3) provided answers. Human readers (seven junior faculty radiologists, two clinicians, one in-training radiologist, and one medical student), blinded to the published answers, also answered. LLM accuracy with and without image inputs and short (cases from 2005 to 2015) versus long text inputs (from 2016 to 2024) was evaluated in subgroup analysis to determine the effect of these factors. Factor analysis was assessed using multivariable logistic regression. Accuracy was compared with generalized estimating equations, with multiple comparisons adjusted by using Bonferroni correction. Results A total of 272 cases were included. GPT-4o achieved the highest overall accuracy among LLMs (59.6%; 162 of 272), outperforming a medical student (47.1%; 128 of 272; P < .001) but not junior faculty (80.9%; 220 of 272; P < .001) or the in-training radiologist (70.2%; 191 of 272; P = .003). GPT-4o exhibited similar accuracy regardless of image inputs (without images vs with images, 54.0% [147 of 272] vs 59.6% [162 of 272], respectively; P = .59). Human reader accuracy was unaffected by text length, whereas LLMs demonstrated higher accuracy with long text inputs (all P < .001). Text input length affected LLM accuracy (odds ratio range, 3.2 [95% CI: 1.9, 5.5] to 6.6 [95% CI: 3.7, 12.0]). Conclusion LLMs demonstrated substantial accuracy with text and image inputs, outperforming a medical student. However, their accuracy decreased with shorter text lengths, regardless of image input. © RSNA, 2024 Supplemental material is available for this article.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
比较大型语言模型和人类阅读器与新英格兰医学杂志图像挑战案例图像输入的准确性。
具有文本和视觉功能的多模态大语言模型(llm)的应用正在稳步增加,但其对放射图像的解释能力仍然受到质疑。目的评估LLM的准确性,并将其与不同经验水平的人类读者的准确性进行比较,并评估影响LLM在回答新英格兰医学杂志图像挑战案例时准确性的因素。材料与方法回顾性分析2005年10月13日至2024年4月18日病例的影像学资料。使用文本和图像输入,LLMs (Open AI的GPT-4 Turbo with Vision [GPT-4V]和GPT-4 Omni [gpt - 40],谷歌的DeepMind Gemini 1.5 Pro和Anthropic的Claude 3)提供了答案。人类读者(七名初级放射科医生、两名临床医生、一名在职放射科医生和一名医科学生)对发表的答案一无所知,也参与了回答。在亚组分析中评估了有和没有图像输入以及短(2005年至2015年的案例)与长文本输入(2016年至2024年)的LLM准确性,以确定这些因素的影响。因子分析采用多变量logistic回归进行评估。比较了广义估计方程的精度,并采用Bonferroni校正对多重比较进行了调整。结果共纳入272例。gpt - 40在LLMs中获得了最高的总体准确率(59.6%;272人中的162人),比医科学生(47.1%;272人中的128人;P < 0.001),但初级教师(80.9%;272人中有220人;P < 0.001)或在职放射科医师(70.2%;272人中的191人;P = .003)。无论图像输入如何,gpt - 40都表现出相似的准确性(无图像与有图像,分别为54.0%[147 / 272]和59.6% [162 / 272];P = .59)。人类读者的准确性不受文本长度的影响,而llm在长文本输入时表现出更高的准确性(均P < 0.001)。文本输入长度影响LLM准确率(比值比范围为3.2 [95% CI: 1.9, 5.5]至6.6 [95% CI: 3.7, 12.0])。结论法学硕士在文本和图像输入方面表现出相当高的准确性,优于医学生。然而,无论图像输入如何,它们的准确性随着文本长度的缩短而下降。©RSNA, 2024本文可获得补充材料。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Radiology
Radiology 医学-核医学
CiteScore
35.20
自引率
3.00%
发文量
596
审稿时长
3.6 months
期刊介绍: Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies. Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.
期刊最新文献
A Leadership Primer. COVID-19 Infection and Coronary Plaque Progression: An Early Warning of a Potential Public Health Crisis. Advancing Care: Managing Small Late-Recurrence Hepatocellular Carcinoma with Image-guided Therapy. AI-generated Clinical Histories for Radiology Reports: Closing the Information Gap. CT Honeycombing and Traction Bronchiectasis Extent Independently Predict Survival across Fibrotic Interstitial Lung Disease Subtypes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1