Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases.

IF 12.1 1区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Radiology Pub Date : 2024-07-01 DOI:10.1148/radiol.240273
Pae Sun Suh, Woo Hyun Shim, Chong Hyun Suh, Hwon Heo, Chae Ri Park, Hye Joung Eom, Kye Jin Park, Jooae Choe, Pyeong Hwa Kim, Hyo Jung Park, Yura Ahn, Ho Young Park, Yoonseok Choi, Chang-Yun Woo, Hyungjun Park
{"title":"Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases.","authors":"Pae Sun Suh, Woo Hyun Shim, Chong Hyun Suh, Hwon Heo, Chae Ri Park, Hye Joung Eom, Kye Jin Park, Jooae Choe, Pyeong Hwa Kim, Hyo Jung Park, Yura Ahn, Ho Young Park, Yoonseok Choi, Chang-Yun Woo, Hyungjun Park","doi":"10.1148/radiol.240273","DOIUrl":null,"url":null,"abstract":"<p><p>Background The diagnostic abilities of multimodal large language models (LLMs) using direct image inputs and the impact of the temperature parameter of LLMs remain unexplored. Purpose To investigate the ability of GPT-4V and Gemini Pro Vision in generating differential diagnoses at different temperatures compared with radiologists using <i>Radiology</i> Diagnosis Please cases. Materials and Methods This retrospective study included Diagnosis Please cases published from January 2008 to October 2023. Input images included original images and captures of the textual patient history and figure legends (without imaging findings) from PDF files of each case. The LLMs were tasked with providing three differential diagnoses, repeated five times at temperatures 0, 0.5, and 1. Eight subspecialty-trained radiologists solved cases. An experienced radiologist compared generated and final diagnoses, considering the result correct if the generated diagnoses included the final diagnosis after five repetitions. Accuracy was assessed across models, temperatures, and radiology subspecialties, with statistical significance set at <i>P</i> < .007 after Bonferroni correction for multiple comparisons across the LLMs at the three temperatures and with radiologists. Results A total of 190 cases were included in neuroradiology (<i>n</i> = 53), multisystem (<i>n</i> = 27), gastrointestinal (<i>n</i> = 25), genitourinary (<i>n</i> = 23), musculoskeletal (<i>n</i> = 17), chest (<i>n</i> = 16), cardiovascular (<i>n</i> = 12), pediatric (<i>n</i> = 12), and breast (<i>n</i> = 5) subspecialties. Overall accuracy improved with increasing temperature settings (0, 0.5, 1) for both GPT-4V (41% [78 of 190 cases], 45% [86 of 190 cases], 49% [93 of 190 cases], respectively) and Gemini Pro Vision (29% [55 of 190 cases], 36% [69 of 190 cases], 39% [74 of 190 cases], respectively), although there was no evidence of a statistically significant difference after Bonferroni adjustment (GPT-4V, <i>P</i> = .12; Gemini Pro Vision, <i>P</i> = .04). The overall accuracy of radiologists (61% [115 of 190 cases]) was higher than that of Gemini Pro Vision at temperature 1 (T1) (<i>P</i> < .001), while no statistically significant difference was observed between radiologists and GPT-4V at T1 after Bonferroni adjustment (<i>P</i> = .02). Radiologists (range, 45%-88%) outperformed the LLMs at T1 (range, 24%-75%) in most subspecialties. Conclusion Using direct radiologic image inputs, GPT-4V and Gemini Pro Vision showed improved diagnostic accuracy with increasing temperature settings. Although GPT-4V slightly underperformed compared with radiologists, it nonetheless demonstrated promising potential as a supportive tool in diagnostic decision-making. © RSNA, 2024 See also the editorial by Nishino and Ballard in this issue.</p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":null,"pages":null},"PeriodicalIF":12.1000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.240273","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Background The diagnostic abilities of multimodal large language models (LLMs) using direct image inputs and the impact of the temperature parameter of LLMs remain unexplored. Purpose To investigate the ability of GPT-4V and Gemini Pro Vision in generating differential diagnoses at different temperatures compared with radiologists using Radiology Diagnosis Please cases. Materials and Methods This retrospective study included Diagnosis Please cases published from January 2008 to October 2023. Input images included original images and captures of the textual patient history and figure legends (without imaging findings) from PDF files of each case. The LLMs were tasked with providing three differential diagnoses, repeated five times at temperatures 0, 0.5, and 1. Eight subspecialty-trained radiologists solved cases. An experienced radiologist compared generated and final diagnoses, considering the result correct if the generated diagnoses included the final diagnosis after five repetitions. Accuracy was assessed across models, temperatures, and radiology subspecialties, with statistical significance set at P < .007 after Bonferroni correction for multiple comparisons across the LLMs at the three temperatures and with radiologists. Results A total of 190 cases were included in neuroradiology (n = 53), multisystem (n = 27), gastrointestinal (n = 25), genitourinary (n = 23), musculoskeletal (n = 17), chest (n = 16), cardiovascular (n = 12), pediatric (n = 12), and breast (n = 5) subspecialties. Overall accuracy improved with increasing temperature settings (0, 0.5, 1) for both GPT-4V (41% [78 of 190 cases], 45% [86 of 190 cases], 49% [93 of 190 cases], respectively) and Gemini Pro Vision (29% [55 of 190 cases], 36% [69 of 190 cases], 39% [74 of 190 cases], respectively), although there was no evidence of a statistically significant difference after Bonferroni adjustment (GPT-4V, P = .12; Gemini Pro Vision, P = .04). The overall accuracy of radiologists (61% [115 of 190 cases]) was higher than that of Gemini Pro Vision at temperature 1 (T1) (P < .001), while no statistically significant difference was observed between radiologists and GPT-4V at T1 after Bonferroni adjustment (P = .02). Radiologists (range, 45%-88%) outperformed the LLMs at T1 (range, 24%-75%) in most subspecialties. Conclusion Using direct radiologic image inputs, GPT-4V and Gemini Pro Vision showed improved diagnostic accuracy with increasing temperature settings. Although GPT-4V slightly underperformed compared with radiologists, it nonetheless demonstrated promising potential as a supportive tool in diagnostic decision-making. © RSNA, 2024 See also the editorial by Nishino and Ballard in this issue.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用来自诊断请病例的图像输入,比较放射科医生与 GPT-4V 和 Gemini Pro Vision 的诊断准确性。
背景 多模态大语言模型(LLM)使用直接图像输入的诊断能力以及 LLM 温度参数的影响仍未得到研究。目的 研究 GPT-4V 和 Gemini Pro Vision 在不同温度下生成鉴别诊断的能力,并与使用放射学诊断 Please 病例的放射科医生进行比较。材料和方法 这项回顾性研究包括 2008 年 1 月至 2023 年 10 月期间发布的 Diagnosis Please 病例。输入的图像包括原始图像以及每个病例 PDF 文件中的患者病史文字和图例(无成像结果)。LLMs 的任务是提供三个鉴别诊断,在温度为 0、0.5 和 1 时重复五次。一位经验丰富的放射科医生比较了生成的诊断和最终诊断,如果生成的诊断包括重复五次后的最终诊断,则认为结果正确。在对三种温度下的 LLMs 和放射科医生进行多重比较的 Bonferroni 校正后,统计显著性设定为 P <.007。结果 神经放射科(53 例)、多系统放射科(27 例)、胃肠道放射科(25 例)、泌尿生殖系统放射科(23 例)、肌肉骨骼放射科(17 例)、胸部放射科(16 例)、心血管放射科(12 例)、儿科(12 例)和乳腺放射科(5 例)共纳入 190 例病例。随着温度设置(0、0.5、1)的增加,GPT-4V(分别为 41% [190 例中的 78 例]、45% [190 例中的 86 例]、49% [190 例中的 93 例])和 Gemini Pro Vision(分别为 29% [190 例中的 55 例]、36% [190 例中的 69 例]、39% [190 例中的 74 例])的总体准确率都有所提高,但经过 Bonferroni 调整后,没有证据表明两者之间存在显著的统计学差异(GPT-4V,P = .12;Gemini Pro Vision,P = .04)。在温度 1(T1)时,放射科医生的总体准确率(61% [190例中的115例])高于 Gemini Pro Vision(P < .001),而经过 Bonferroni 调整后,放射科医生和 GPT-4V 在温度 1 时的准确率在统计学上无显著差异(P = .02)。在大多数亚专科中,放射科医生(范围为 45%-88%)在 T1 时的表现优于 LLMs(范围为 24%-75%)。结论 使用直接放射图像输入,GPT-4V 和 Gemini Pro Vision 的诊断准确性随着温度设置的增加而提高。虽然 GPT-4V 与放射科医生相比略有不足,但作为诊断决策的辅助工具,它还是表现出了巨大的潜力。RSNA, 2024 另请参阅本期 Nishino 和 Ballard 的社论。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Radiology
Radiology 医学-核医学
CiteScore
35.20
自引率
3.00%
发文量
596
审稿时长
3.6 months
期刊介绍: Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies. Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.
期刊最新文献
Amplifying Research in Radiology: The Podcast Effect. Amplifying Research: The Potential for Podcasts to Boost Radiology Journal Article Exposure. Automated Interstitial Lung Abnormality Probability Prediction at CT: A Stepwise Machine Learning Approach in the Boston Lung Cancer Study. Biomarkers for Personalized Neoadjuvant Therapy in Triple-Negative Breast Cancer: Moving Forward. Calcified Osteosarcoma Lung Metastases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1