Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases.
Severin Schramm, Silas Preis, Marie-Christin Metz, Kirsten Jung, Benita Schmitz-Koep, Claus Zimmer, Benedikt Wiestler, Dennis M Hedderich, Su Hwan Kim
求助PDF
{"title":"Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4V in Challenging Brain MRI Cases.","authors":"Severin Schramm, Silas Preis, Marie-Christin Metz, Kirsten Jung, Benita Schmitz-Koep, Claus Zimmer, Benedikt Wiestler, Dennis M Hedderich, Su Hwan Kim","doi":"10.1148/radiol.240689","DOIUrl":null,"url":null,"abstract":"<p><p>Background Studies have explored the application of multimodal large language models (LLMs) in radiologic differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. Purpose To evaluate the impact of varying multimodal input elements on the accuracy of OpenAI's GPT-4 with vision (GPT-4V)-based brain MRI differential diagnosis. Materials and Methods Sixty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image without modifiers [I], annotation [A], medical history [H], and image description [D]) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (Perplexity AI, powered by GPT-4V). The accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a χ<sup>2</sup> test and a Kruskal-Wallis test. Results were corrected for false-discovery rate with use of the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each input element to diagnostic performance. Results The prompt group containing I, A, H, and D as input exhibited the highest diagnostic accuracy (124 of 180 responses [69%]). Significant differences were observed between prompt groups that contained D among their inputs and those that did not. Unannotated (I) (four of 180 responses [2.2%]) or annotated radiologic images alone (I and A) (two of 180 responses [1.1%]) yielded very low diagnostic accuracy. Regression analyses confirmed a large positive effect of D on diagnostic accuracy (odds ratio [OR], 68.03; <i>P</i> < .001), as well as a moderate positive effect of H (OR, 4.18; <i>P</i> < .001). Conclusion The textual description of radiologic image findings was identified as the strongest contributor to the performance of GPT-4V in brain MRI differential diagnosis, followed by the medical history; unannotated or annotated images alone yielded very low diagnostic performance. © RSNA, 2025 <i>Supplemental material is available for this article.</i></p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":"314 1","pages":"e240689"},"PeriodicalIF":12.1000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.240689","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
引用
批量引用
Abstract
Background Studies have explored the application of multimodal large language models (LLMs) in radiologic differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. Purpose To evaluate the impact of varying multimodal input elements on the accuracy of OpenAI's GPT-4 with vision (GPT-4V)-based brain MRI differential diagnosis. Materials and Methods Sixty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image without modifiers [I], annotation [A], medical history [H], and image description [D]) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (Perplexity AI, powered by GPT-4V). The accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a χ2 test and a Kruskal-Wallis test. Results were corrected for false-discovery rate with use of the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each input element to diagnostic performance. Results The prompt group containing I, A, H, and D as input exhibited the highest diagnostic accuracy (124 of 180 responses [69%]). Significant differences were observed between prompt groups that contained D among their inputs and those that did not. Unannotated (I) (four of 180 responses [2.2%]) or annotated radiologic images alone (I and A) (two of 180 responses [1.1%]) yielded very low diagnostic accuracy. Regression analyses confirmed a large positive effect of D on diagnostic accuracy (odds ratio [OR], 68.03; P < .001), as well as a moderate positive effect of H (OR, 4.18; P < .001). Conclusion The textual description of radiologic image findings was identified as the strongest contributor to the performance of GPT-4V in brain MRI differential diagnosis, followed by the medical history; unannotated or annotated images alone yielded very low diagnostic performance. © RSNA, 2025 Supplemental material is available for this article.
多模式提示元素对高难度脑MRI病例GPT-4V诊断性能的影响
研究已经探索了多模态大语言模型(LLMs)在放射学鉴别诊断中的应用。然而,不同的多模态输入组合如何影响诊断性能尚不清楚。目的评估不同多模态输入元素对OpenAI基于视觉的GPT-4 (GPT-4V)脑MRI鉴别诊断准确性的影响。材料与方法选择60例诊断具有挑战性但经证实的脑MRI病例。定义了4个输入元素(无修饰符图像[I]、注释[A]、病史[H]和图像描述[D])的7个提示组。对于每个MRI病例和提示组,使用基于llm的搜索引擎(Perplexity AI,由GPT-4V提供支持)执行三个相同的查询。llm产生的鉴别诊断的准确性使用二进制和数字评分系统进行评分,并使用χ2检验和Kruskal-Wallis检验进行分析。使用Benjamini-Hochberg程序对结果进行错误发现率校正。进行回归分析以确定每个输入元素对诊断性能的贡献。结果包含I, A, H和D作为输入的提示组显示出最高的诊断准确性(180个应答中有124个[69%])。在输入中包含D的提示组和不包含D的提示组之间观察到显著差异。未注释的(I)(180个应答中的4个[2.2%])或单独注释的放射学图像(I和A)(180个应答中的2个[1.1%])的诊断准确性非常低。回归分析证实D对诊断准确性有显著的正向影响(优势比[OR], 68.03;P < .001),以及H的中度正作用(OR, 4.18;P < 0.001)。结论影像学表现的文字描述对GPT-4V在脑MRI鉴别诊断中的作用最大,其次是病史;未注释或单独注释的图像产生非常低的诊断性能。©RSNA, 2025本文可获得补充材料。
本文章由计算机程序翻译,如有差异,请以英文原文为准。