Alana M. McNulty , Harshitha Valluri , Avi A. Gajjar, Amanda Custozzo, Nicholas C. Field, Alexandra R. Paul
{"title":"Performance evaluation of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions: A comparative analysis","authors":"Alana M. McNulty , Harshitha Valluri , Avi A. Gajjar, Amanda Custozzo, Nicholas C. Field, Alexandra R. Paul","doi":"10.1016/j.jocn.2025.111097","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Artificial intelligence (AI) has gained significant attention in medicine, particularly in neurosurgery, where its potential is frequently discussed and occasionally feared. Large language models (LLMs), such as ChatGPT-4.0 (OpenAI) and Gemini (Google DeepMind), have shown promise in text-based tasks but remain underexplored in image-based domains, which are essential for neurosurgery. This study evaluates the performance of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions, focusing on their ability to interpret visual data, a critical aspect of neurosurgical decision-making.</div></div><div><h3>Methods</h3><div>A total of 250 image-based questions selected from two neurosurgical review textbooks were obtained. Each question was presented to both ChatGPT-4.0 and Gemini in its original format, including images such as MRI scans, pathology slides, and surgical visuals. The models were tasked with answering the questions, and their accuracy was determined based on the number of correct responses.</div></div><div><h3>Results</h3><div>ChatGPT-4.0 correctly answered 84 questions (33.6 %), significantly outperforming Gemini, which answered only 1 question correctly (0.4 %) (p < 0.0001). ChatGPT-4.0 provided correct answers for 17.7 % of questions from The Comprehensive Neurosurgery Board Preparation Book and 50.0 % from Neurosurgery Board Review. Gemini exhibited a 17.8 % “inability response” rate, explicitly stating it could not interpret images. The performance gap between the two models was significant (p < 0.0001), highlighting their limitations in handling complex visual data.</div></div><div><h3>Conclusions</h3><div>While ChatGPT-4.0 demonstrated some capacity to interpret image-based neurosurgery board questions, both models exhibited significant limitations, particularly in processing and analyzing complex visual data. These findings emphasize the need for targeted advancements in AI to improve visual interpretation in neurosurgical education and practice.</div></div>","PeriodicalId":15487,"journal":{"name":"Journal of Clinical Neuroscience","volume":"134 ","pages":"Article 111097"},"PeriodicalIF":1.9000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Neuroscience","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0967586825000694","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction
Artificial intelligence (AI) has gained significant attention in medicine, particularly in neurosurgery, where its potential is frequently discussed and occasionally feared. Large language models (LLMs), such as ChatGPT-4.0 (OpenAI) and Gemini (Google DeepMind), have shown promise in text-based tasks but remain underexplored in image-based domains, which are essential for neurosurgery. This study evaluates the performance of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions, focusing on their ability to interpret visual data, a critical aspect of neurosurgical decision-making.
Methods
A total of 250 image-based questions selected from two neurosurgical review textbooks were obtained. Each question was presented to both ChatGPT-4.0 and Gemini in its original format, including images such as MRI scans, pathology slides, and surgical visuals. The models were tasked with answering the questions, and their accuracy was determined based on the number of correct responses.
Results
ChatGPT-4.0 correctly answered 84 questions (33.6 %), significantly outperforming Gemini, which answered only 1 question correctly (0.4 %) (p < 0.0001). ChatGPT-4.0 provided correct answers for 17.7 % of questions from The Comprehensive Neurosurgery Board Preparation Book and 50.0 % from Neurosurgery Board Review. Gemini exhibited a 17.8 % “inability response” rate, explicitly stating it could not interpret images. The performance gap between the two models was significant (p < 0.0001), highlighting their limitations in handling complex visual data.
Conclusions
While ChatGPT-4.0 demonstrated some capacity to interpret image-based neurosurgery board questions, both models exhibited significant limitations, particularly in processing and analyzing complex visual data. These findings emphasize the need for targeted advancements in AI to improve visual interpretation in neurosurgical education and practice.
期刊介绍:
This International journal, Journal of Clinical Neuroscience, publishes articles on clinical neurosurgery and neurology and the related neurosciences such as neuro-pathology, neuro-radiology, neuro-ophthalmology and neuro-physiology.
The journal has a broad International perspective, and emphasises the advances occurring in Asia, the Pacific Rim region, Europe and North America. The Journal acts as a focus for publication of major clinical and laboratory research, as well as publishing solicited manuscripts on specific subjects from experts, case reports and other information of interest to clinicians working in the clinical neurosciences.