Role of visual information in multimodal large language model performance: an evaluation using the Japanese nuclear medicine board examination.

IF 2.5 4区 医学 Q2 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Annals of Nuclear Medicine Pub Date : 2024-11-13 DOI:10.1007/s12149-024-01992-8
Takashi Watanabe, Akira Baba, Takeshi Fukuda, Ken Watanabe, Jun Woo, Hiroya Ojiri
{"title":"Role of visual information in multimodal large language model performance: an evaluation using the Japanese nuclear medicine board examination.","authors":"Takashi Watanabe, Akira Baba, Takeshi Fukuda, Ken Watanabe, Jun Woo, Hiroya Ojiri","doi":"10.1007/s12149-024-01992-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study aimed to assess the performance of state-of-the-art multimodal large language models (LLMs), specifically GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, on Japanese Nuclear Medicine Board Examination (JNMBE) questions and to evaluate the influence of visual information on the decision-making process.</p><p><strong>Methods: </strong>This study utilized 92 questions with images from the JNMBE (2019-2023). The LLMs' responses were assessed under two conditions: providing both text and images and providing only text. Each model answered all questions thrice, and the most frequent answer choice was considered the final answer. The accuracy and agreement rates among the model answers were evaluated using statistical tests.</p><p><strong>Results: </strong>GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro exhibited no significant differences in terms of accuracy between the text-and-image and text-only conditions. GPT-4o and Claude 3 Opus demonstrated accuracies of 54.3% (95% CI: 44.2%-64.1%) each when provided with both text and images; however, they selected the same options as in the text-only condition for 71.7% of the questions. Gemini 1.5 Pro performed significantly worse than GPT-4o under text and image conditions. The agreement rates among the model answers ranged from weak to moderate.</p><p><strong>Conclusion: </strong>The influence of images on decision-making in nuclear medicine is limited to the latest multimodal LLMs, and their diagnostic ability in this highly specialized field remains insufficient. Improving the utilization of image information and enhancing the answer reproducibility are crucial for the effective application of LLMs in nuclear medicine education and practice. Further advancements in these areas are necessary to harness the potential of LLMs as assistants in nuclear medicine diagnosis.</p>","PeriodicalId":8007,"journal":{"name":"Annals of Nuclear Medicine","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Nuclear Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12149-024-01992-8","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: This study aimed to assess the performance of state-of-the-art multimodal large language models (LLMs), specifically GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, on Japanese Nuclear Medicine Board Examination (JNMBE) questions and to evaluate the influence of visual information on the decision-making process.

Methods: This study utilized 92 questions with images from the JNMBE (2019-2023). The LLMs' responses were assessed under two conditions: providing both text and images and providing only text. Each model answered all questions thrice, and the most frequent answer choice was considered the final answer. The accuracy and agreement rates among the model answers were evaluated using statistical tests.

Results: GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro exhibited no significant differences in terms of accuracy between the text-and-image and text-only conditions. GPT-4o and Claude 3 Opus demonstrated accuracies of 54.3% (95% CI: 44.2%-64.1%) each when provided with both text and images; however, they selected the same options as in the text-only condition for 71.7% of the questions. Gemini 1.5 Pro performed significantly worse than GPT-4o under text and image conditions. The agreement rates among the model answers ranged from weak to moderate.

Conclusion: The influence of images on decision-making in nuclear medicine is limited to the latest multimodal LLMs, and their diagnostic ability in this highly specialized field remains insufficient. Improving the utilization of image information and enhancing the answer reproducibility are crucial for the effective application of LLMs in nuclear medicine education and practice. Further advancements in these areas are necessary to harness the potential of LLMs as assistants in nuclear medicine diagnosis.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
视觉信息在多模态大语言模型性能中的作用:利用日本核医学委员会考试进行的评估。
研究目的本研究旨在评估最先进的多模态大语言模型(LLM),特别是 GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro,在日本核医学委员会考试(JNMBE)试题中的表现,并评估视觉信息对决策过程的影响:本研究使用了日本核医学委员会考试(2019-2023 年)中带有图像的 92 个问题。在两种条件下对法律硕士的回答进行了评估:同时提供文字和图片和仅提供文字。每个模型对所有问题回答三次,最常见的答案选项被视为最终答案。通过统计检验对模型答案的准确率和一致率进行了评估:结果:GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro 在文字加图像和纯文字条件下的准确率没有明显差异。当同时提供文字和图像时,GPT-4o 和 Claude 3 Opus 的准确率分别为 54.3%(95% CI:44.2%-64.1%);然而,他们在 71.7% 的问题中选择了与纯文字条件下相同的选项。在文字和图像条件下,Gemini 1.5 Pro 的表现明显不如 GPT-4o。模型答案之间的一致率从弱到中等不等:结论:图像对核医学决策的影响仅限于最新的多模态 LLM,其在这一高度专业化领域的诊断能力仍然不足。要在核医学教育和实践中有效应用 LLMs,提高图像信息的利用率和答案的可重复性至关重要。要发挥 LLM 作为核医学诊断助手的潜力,就必须在这些领域取得进一步进展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Annals of Nuclear Medicine
Annals of Nuclear Medicine 医学-核医学
CiteScore
4.90
自引率
7.70%
发文量
111
审稿时长
4-8 weeks
期刊介绍: Annals of Nuclear Medicine is an official journal of the Japanese Society of Nuclear Medicine. It develops the appropriate application of radioactive substances and stable nuclides in the field of medicine. The journal promotes the exchange of ideas and information and research in nuclear medicine and includes the medical application of radionuclides and related subjects. It presents original articles, short communications, reviews and letters to the editor.
期刊最新文献
Role of visual information in multimodal large language model performance: an evaluation using the Japanese nuclear medicine board examination. Comparison of early and standard 18F-PSMA-11 PET/CT imaging in treatment-naïve patients with prostate cancer. Increased individual workload for nuclear medicine physicians over the past years: 2008-2023 data from The Netherlands. Research trends and hotspots of radioiodine-refractory thyroid cancer treatment in the twenty-first century: a bibliometric analysis. Long-term effect of postoperative radioactive iodine therapy on parathyroid function in patients with differentiated thyroid cancer.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1