Performance evaluation of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions: A comparative analysis

IF 1.9 4区 医学 Q3 CLINICAL NEUROLOGY Journal of Clinical Neuroscience Pub Date : 2025-02-11 DOI:10.1016/j.jocn.2025.111097
Alana M. McNulty , Harshitha Valluri , Avi A. Gajjar, Amanda Custozzo, Nicholas C. Field, Alexandra R. Paul
{"title":"Performance evaluation of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions: A comparative analysis","authors":"Alana M. McNulty ,&nbsp;Harshitha Valluri ,&nbsp;Avi A. Gajjar,&nbsp;Amanda Custozzo,&nbsp;Nicholas C. Field,&nbsp;Alexandra R. Paul","doi":"10.1016/j.jocn.2025.111097","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Artificial intelligence (AI) has gained significant attention in medicine, particularly in neurosurgery, where its potential is frequently discussed and occasionally feared. Large language models (LLMs), such as ChatGPT-4.0 (OpenAI) and Gemini (Google DeepMind), have shown promise in text-based tasks but remain underexplored in image-based domains, which are essential for neurosurgery. This study evaluates the performance of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions, focusing on their ability to interpret visual data, a critical aspect of neurosurgical decision-making.</div></div><div><h3>Methods</h3><div>A total of 250 image-based questions selected from two neurosurgical review textbooks were obtained. Each question was presented to both ChatGPT-4.0 and Gemini in its original format, including images such as MRI scans, pathology slides, and surgical visuals. The models were tasked with answering the questions, and their accuracy was determined based on the number of correct responses.</div></div><div><h3>Results</h3><div>ChatGPT-4.0 correctly answered 84 questions (33.6 %), significantly outperforming Gemini, which answered only 1 question correctly (0.4 %) (p &lt; 0.0001). ChatGPT-4.0 provided correct answers for 17.7 % of questions from The Comprehensive Neurosurgery Board Preparation Book and 50.0 % from Neurosurgery Board Review. Gemini exhibited a 17.8 % “inability response” rate, explicitly stating it could not interpret images. The performance gap between the two models was significant (p &lt; 0.0001), highlighting their limitations in handling complex visual data.</div></div><div><h3>Conclusions</h3><div>While ChatGPT-4.0 demonstrated some capacity to interpret image-based neurosurgery board questions, both models exhibited significant limitations, particularly in processing and analyzing complex visual data. These findings emphasize the need for targeted advancements in AI to improve visual interpretation in neurosurgical education and practice.</div></div>","PeriodicalId":15487,"journal":{"name":"Journal of Clinical Neuroscience","volume":"134 ","pages":"Article 111097"},"PeriodicalIF":1.9000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Neuroscience","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0967586825000694","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction

Artificial intelligence (AI) has gained significant attention in medicine, particularly in neurosurgery, where its potential is frequently discussed and occasionally feared. Large language models (LLMs), such as ChatGPT-4.0 (OpenAI) and Gemini (Google DeepMind), have shown promise in text-based tasks but remain underexplored in image-based domains, which are essential for neurosurgery. This study evaluates the performance of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions, focusing on their ability to interpret visual data, a critical aspect of neurosurgical decision-making.

Methods

A total of 250 image-based questions selected from two neurosurgical review textbooks were obtained. Each question was presented to both ChatGPT-4.0 and Gemini in its original format, including images such as MRI scans, pathology slides, and surgical visuals. The models were tasked with answering the questions, and their accuracy was determined based on the number of correct responses.

Results

ChatGPT-4.0 correctly answered 84 questions (33.6 %), significantly outperforming Gemini, which answered only 1 question correctly (0.4 %) (p < 0.0001). ChatGPT-4.0 provided correct answers for 17.7 % of questions from The Comprehensive Neurosurgery Board Preparation Book and 50.0 % from Neurosurgery Board Review. Gemini exhibited a 17.8 % “inability response” rate, explicitly stating it could not interpret images. The performance gap between the two models was significant (p < 0.0001), highlighting their limitations in handling complex visual data.

Conclusions

While ChatGPT-4.0 demonstrated some capacity to interpret image-based neurosurgery board questions, both models exhibited significant limitations, particularly in processing and analyzing complex visual data. These findings emphasize the need for targeted advancements in AI to improve visual interpretation in neurosurgical education and practice.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Clinical Neuroscience
Journal of Clinical Neuroscience 医学-临床神经学
CiteScore
4.50
自引率
0.00%
发文量
402
审稿时长
40 days
期刊介绍: This International journal, Journal of Clinical Neuroscience, publishes articles on clinical neurosurgery and neurology and the related neurosciences such as neuro-pathology, neuro-radiology, neuro-ophthalmology and neuro-physiology. The journal has a broad International perspective, and emphasises the advances occurring in Asia, the Pacific Rim region, Europe and North America. The Journal acts as a focus for publication of major clinical and laboratory research, as well as publishing solicited manuscripts on specific subjects from experts, case reports and other information of interest to clinicians working in the clinical neurosciences.
期刊最新文献
Current and future clinical trials for the use of neuromodulation in the treatment of stroke: A review of the clinical Trials.gov database Risk factors for residual dizziness after successful repositioning in elderly patients with benign paroxysmal positional vertigo Clinical and imaging features and treatment response of anti-NMDAR encephalitis combined with MOGAD Validation of the Korean version of the Bad Sobernheim stress Questionnaire-Brace in adolescent idiopathic scoliosis Pure endoscopic presigmoid infralabyrinthine approach for jugular foramen tumors: Operative technique and early results
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1