Performance evaluation of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions: A comparative analysis

IF 1.8 4区 医学 Q3 CLINICAL NEUROLOGY Journal of Clinical Neuroscience Pub Date : 2025-04-01 Epub Date: 2025-02-11 DOI:10.1016/j.jocn.2025.111097
Alana M. McNulty , Harshitha Valluri , Avi A. Gajjar, Amanda Custozzo, Nicholas C. Field, Alexandra R. Paul
{"title":"Performance evaluation of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions: A comparative analysis","authors":"Alana M. McNulty ,&nbsp;Harshitha Valluri ,&nbsp;Avi A. Gajjar,&nbsp;Amanda Custozzo,&nbsp;Nicholas C. Field,&nbsp;Alexandra R. Paul","doi":"10.1016/j.jocn.2025.111097","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Artificial intelligence (AI) has gained significant attention in medicine, particularly in neurosurgery, where its potential is frequently discussed and occasionally feared. Large language models (LLMs), such as ChatGPT-4.0 (OpenAI) and Gemini (Google DeepMind), have shown promise in text-based tasks but remain underexplored in image-based domains, which are essential for neurosurgery. This study evaluates the performance of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions, focusing on their ability to interpret visual data, a critical aspect of neurosurgical decision-making.</div></div><div><h3>Methods</h3><div>A total of 250 image-based questions selected from two neurosurgical review textbooks were obtained. Each question was presented to both ChatGPT-4.0 and Gemini in its original format, including images such as MRI scans, pathology slides, and surgical visuals. The models were tasked with answering the questions, and their accuracy was determined based on the number of correct responses.</div></div><div><h3>Results</h3><div>ChatGPT-4.0 correctly answered 84 questions (33.6 %), significantly outperforming Gemini, which answered only 1 question correctly (0.4 %) (p &lt; 0.0001). ChatGPT-4.0 provided correct answers for 17.7 % of questions from The Comprehensive Neurosurgery Board Preparation Book and 50.0 % from Neurosurgery Board Review. Gemini exhibited a 17.8 % “inability response” rate, explicitly stating it could not interpret images. The performance gap between the two models was significant (p &lt; 0.0001), highlighting their limitations in handling complex visual data.</div></div><div><h3>Conclusions</h3><div>While ChatGPT-4.0 demonstrated some capacity to interpret image-based neurosurgery board questions, both models exhibited significant limitations, particularly in processing and analyzing complex visual data. These findings emphasize the need for targeted advancements in AI to improve visual interpretation in neurosurgical education and practice.</div></div>","PeriodicalId":15487,"journal":{"name":"Journal of Clinical Neuroscience","volume":"134 ","pages":"Article 111097"},"PeriodicalIF":1.8000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Neuroscience","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0967586825000694","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/11 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction

Artificial intelligence (AI) has gained significant attention in medicine, particularly in neurosurgery, where its potential is frequently discussed and occasionally feared. Large language models (LLMs), such as ChatGPT-4.0 (OpenAI) and Gemini (Google DeepMind), have shown promise in text-based tasks but remain underexplored in image-based domains, which are essential for neurosurgery. This study evaluates the performance of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions, focusing on their ability to interpret visual data, a critical aspect of neurosurgical decision-making.

Methods

A total of 250 image-based questions selected from two neurosurgical review textbooks were obtained. Each question was presented to both ChatGPT-4.0 and Gemini in its original format, including images such as MRI scans, pathology slides, and surgical visuals. The models were tasked with answering the questions, and their accuracy was determined based on the number of correct responses.

Results

ChatGPT-4.0 correctly answered 84 questions (33.6 %), significantly outperforming Gemini, which answered only 1 question correctly (0.4 %) (p < 0.0001). ChatGPT-4.0 provided correct answers for 17.7 % of questions from The Comprehensive Neurosurgery Board Preparation Book and 50.0 % from Neurosurgery Board Review. Gemini exhibited a 17.8 % “inability response” rate, explicitly stating it could not interpret images. The performance gap between the two models was significant (p < 0.0001), highlighting their limitations in handling complex visual data.

Conclusions

While ChatGPT-4.0 demonstrated some capacity to interpret image-based neurosurgery board questions, both models exhibited significant limitations, particularly in processing and analyzing complex visual data. These findings emphasize the need for targeted advancements in AI to improve visual interpretation in neurosurgical education and practice.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ChatGPT-4.0和Gemini在基于图像的神经外科委员会习题中的性能评价:比较分析
人工智能(AI)在医学上获得了极大的关注,特别是在神经外科领域,它的潜力经常被讨论,偶尔也会被担心。大型语言模型(llm),如ChatGPT-4.0 (OpenAI)和Gemini(谷歌DeepMind),在基于文本的任务中显示出希望,但在基于图像的领域仍未得到充分探索,这对神经外科至关重要。本研究评估了ChatGPT-4.0和Gemini在基于图像的神经外科委员会实践问题上的表现,重点关注他们解释视觉数据的能力,这是神经外科决策的关键方面。方法从2本神经外科复习教材中抽取250个基于图像的问题。每个问题都以原始格式提交给ChatGPT-4.0和Gemini,包括MRI扫描、病理切片和手术视觉图像等图像。结果schatgpt -4.0正确回答了84个问题(33.6%),显著优于仅回答1个问题(0.4%)的Gemini (p <;0.0001)。ChatGPT-4.0为《综合神经外科委员会准备书》中的17.7%和《神经外科委员会评论》中的50.0%的问题提供了正确答案。Gemini显示出17.8%的“无法反应”率,明确表示它无法解读图像。两种模型之间的性能差距显著(p <;0.0001),强调了它们在处理复杂视觉数据方面的局限性。虽然ChatGPT-4.0显示出一定的能力来解释基于图像的神经外科委员会问题,但这两种模型都表现出明显的局限性,特别是在处理和分析复杂的视觉数据方面。这些发现强调了在神经外科教育和实践中有针对性地发展人工智能以改善视觉解释的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Clinical Neuroscience
Journal of Clinical Neuroscience 医学-临床神经学
CiteScore
4.50
自引率
0.00%
发文量
402
审稿时长
40 days
期刊介绍: This International journal, Journal of Clinical Neuroscience, publishes articles on clinical neurosurgery and neurology and the related neurosciences such as neuro-pathology, neuro-radiology, neuro-ophthalmology and neuro-physiology. The journal has a broad International perspective, and emphasises the advances occurring in Asia, the Pacific Rim region, Europe and North America. The Journal acts as a focus for publication of major clinical and laboratory research, as well as publishing solicited manuscripts on specific subjects from experts, case reports and other information of interest to clinicians working in the clinical neurosciences.
期刊最新文献
A novel cosmetic vascularized bone flap craniotomy for postauricular transmastoid approach for jugular foramen tumors Multimodal non-invasive detection of delayed cerebral ischemia in patients with aneurysmal subarachnoid hemorrhage Large trolard vein mimicking dural arteriovenous fistula Measuring anesthesia process reliability during endovascular thrombectomy for acute ischemic stroke: insights from stringent composite metrics and Pareto analysis Long-term trajectories of cognitive function, fatigue, and quality of life in patients after pituitary adenoma surgery: A retrospective study
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1