Introduction
Artificial intelligence (AI) has gained significant attention in medicine, particularly in neurosurgery, where its potential is frequently discussed and occasionally feared. Large language models (LLMs), such as ChatGPT-4.0 (OpenAI) and Gemini (Google DeepMind), have shown promise in text-based tasks but remain underexplored in image-based domains, which are essential for neurosurgery. This study evaluates the performance of ChatGPT-4.0 and Gemini on image-based neurosurgery board practice questions, focusing on their ability to interpret visual data, a critical aspect of neurosurgical decision-making.
Methods
A total of 250 image-based questions selected from two neurosurgical review textbooks were obtained. Each question was presented to both ChatGPT-4.0 and Gemini in its original format, including images such as MRI scans, pathology slides, and surgical visuals. The models were tasked with answering the questions, and their accuracy was determined based on the number of correct responses.
Results
ChatGPT-4.0 correctly answered 84 questions (33.6 %), significantly outperforming Gemini, which answered only 1 question correctly (0.4 %) (p < 0.0001). ChatGPT-4.0 provided correct answers for 17.7 % of questions from The Comprehensive Neurosurgery Board Preparation Book and 50.0 % from Neurosurgery Board Review. Gemini exhibited a 17.8 % “inability response” rate, explicitly stating it could not interpret images. The performance gap between the two models was significant (p < 0.0001), highlighting their limitations in handling complex visual data.
Conclusions
While ChatGPT-4.0 demonstrated some capacity to interpret image-based neurosurgery board questions, both models exhibited significant limitations, particularly in processing and analyzing complex visual data. These findings emphasize the need for targeted advancements in AI to improve visual interpretation in neurosurgical education and practice.