Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases.

IF 12.1 1区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Radiology Pub Date : 2025-01-01 DOI:10.1148/radiol.240689
Severin Schramm, Silas Preis, Marie-Christin Metz, Kirsten Jung, Benita Schmitz-Koep, Claus Zimmer, Benedikt Wiestler, Dennis M Hedderich, Su Hwan Kim
{"title":"Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases.","authors":"Severin Schramm, Silas Preis, Marie-Christin Metz, Kirsten Jung, Benita Schmitz-Koep, Claus Zimmer, Benedikt Wiestler, Dennis M Hedderich, Su Hwan Kim","doi":"10.1148/radiol.240689","DOIUrl":null,"url":null,"abstract":"<p><p>Background Studies have explored the application of multimodal large language models (LLMs) in radiologic differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. Purpose To evaluate the impact of varying multimodal input elements on the accuracy of OpenAI's GPT-4 with vision (GPT-4V)-based brain MRI differential diagnosis. Materials and Methods Sixty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image without modifiers [I], annotation [A], medical history [H], and image description [D]) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (Perplexity AI, powered by GPT-4V). The accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a χ<sup>2</sup> test and a Kruskal-Wallis test. Results were corrected for false-discovery rate with use of the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each input element to diagnostic performance. Results The prompt group containing I, A, H, and D as input exhibited the highest diagnostic accuracy (124 of 180 responses [69%]). Significant differences were observed between prompt groups that contained D among their inputs and those that did not. Unannotated (I) (four of 180 responses [2.2%]) or annotated radiologic images alone (I and A) (two of 180 responses [1.1%]) yielded very low diagnostic accuracy. Regression analyses confirmed a large positive effect of D on diagnostic accuracy (odds ratio [OR], 68.03; <i>P</i> < .001), as well as a moderate positive effect of H (OR, 4.18; <i>P</i> < .001). Conclusion The textual description of radiologic image findings was identified as the strongest contributor to the performance of GPT-4V in brain MRI differential diagnosis, followed by the medical history; unannotated or annotated images alone yielded very low diagnostic performance. © RSNA, 2025 <i>Supplemental material is available for this article.</i></p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"314 1","pages":"e240689"},"PeriodicalIF":12.1000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.240689","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Background Studies have explored the application of multimodal large language models (LLMs) in radiologic differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. Purpose To evaluate the impact of varying multimodal input elements on the accuracy of OpenAI's GPT-4 with vision (GPT-4V)-based brain MRI differential diagnosis. Materials and Methods Sixty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image without modifiers [I], annotation [A], medical history [H], and image description [D]) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (Perplexity AI, powered by GPT-4V). The accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a χ2 test and a Kruskal-Wallis test. Results were corrected for false-discovery rate with use of the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each input element to diagnostic performance. Results The prompt group containing I, A, H, and D as input exhibited the highest diagnostic accuracy (124 of 180 responses [69%]). Significant differences were observed between prompt groups that contained D among their inputs and those that did not. Unannotated (I) (four of 180 responses [2.2%]) or annotated radiologic images alone (I and A) (two of 180 responses [1.1%]) yielded very low diagnostic accuracy. Regression analyses confirmed a large positive effect of D on diagnostic accuracy (odds ratio [OR], 68.03; P < .001), as well as a moderate positive effect of H (OR, 4.18; P < .001). Conclusion The textual description of radiologic image findings was identified as the strongest contributor to the performance of GPT-4V in brain MRI differential diagnosis, followed by the medical history; unannotated or annotated images alone yielded very low diagnostic performance. © RSNA, 2025 Supplemental material is available for this article.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
多模式提示元素对高难度脑MRI病例GPT-4V诊断性能的影响
研究已经探索了多模态大语言模型(LLMs)在放射学鉴别诊断中的应用。然而,不同的多模态输入组合如何影响诊断性能尚不清楚。目的评估不同多模态输入元素对OpenAI基于视觉的GPT-4 (GPT-4V)脑MRI鉴别诊断准确性的影响。材料与方法选择60例诊断具有挑战性但经证实的脑MRI病例。定义了4个输入元素(无修饰符图像[I]、注释[A]、病史[H]和图像描述[D])的7个提示组。对于每个MRI病例和提示组,使用基于llm的搜索引擎(Perplexity AI,由GPT-4V提供支持)执行三个相同的查询。llm产生的鉴别诊断的准确性使用二进制和数字评分系统进行评分,并使用χ2检验和Kruskal-Wallis检验进行分析。使用Benjamini-Hochberg程序对结果进行错误发现率校正。进行回归分析以确定每个输入元素对诊断性能的贡献。结果包含I, A, H和D作为输入的提示组显示出最高的诊断准确性(180个应答中有124个[69%])。在输入中包含D的提示组和不包含D的提示组之间观察到显著差异。未注释的(I)(180个应答中的4个[2.2%])或单独注释的放射学图像(I和A)(180个应答中的2个[1.1%])的诊断准确性非常低。回归分析证实D对诊断准确性有显著的正向影响(优势比[OR], 68.03;P < .001),以及H的中度正作用(OR, 4.18;P < 0.001)。结论影像学表现的文字描述对GPT-4V在脑MRI鉴别诊断中的作用最大,其次是病史;未注释或单独注释的图像产生非常低的诊断性能。©RSNA, 2025本文可获得补充材料。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Radiology
Radiology 医学-核医学
CiteScore
35.20
自引率
3.00%
发文量
596
审稿时长
3.6 months
期刊介绍: Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies. Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.
期刊最新文献
A Leadership Primer. COVID-19 Infection and Coronary Plaque Progression: An Early Warning of a Potential Public Health Crisis. Advancing Care: Managing Small Late-Recurrence Hepatocellular Carcinoma with Image-guided Therapy. AI-generated Clinical Histories for Radiology Reports: Closing the Information Gap. CT Honeycombing and Traction Bronchiectasis Extent Independently Predict Survival across Fibrotic Interstitial Lung Disease Subtypes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1