Aim: Large language models (LLMs) are being used more often in medicine, but their impact on neurosurgical education is not well studied. To our knowledge, this study is the first to evaluate Deepseek-R1, Gemini-2.0 Pro, ChatGPT-o3-mini-high, and GPT-4.5 on a mock neurosurgery board exam to assess their accuracy and educational value.
Material and methods: We created a 50-question mock neurosurgery board examination and administered it to three major LLMs and 10 Turkish senior residents. Next, we systematically evaluated their responses for accuracy, reasoning time, word count, and readability. Residents ranked the educational value of the LLM responses. The study also compared two recent ChatGPT versions, o3-mini-high and GPT-4.5, using the same test. Statistical comparisons were used to analyze the results.
Results: In overall accuracy, all three LLMs achieved higher scores than residents, with Deepseek-R1 at 84%, ChatGPT o3 mini-high at 82%, and Gemini 2.0 Pro at 78%, compared to 58% for residents (p 0.001). Deepseek-R1 required the longest reasoning time but provided the most organized responses. Gemini-2.0 Pro produced the most detailed and easy-to-read answers. Residents preferred the explanations from Deepseek-R1 and Gemini-2.0 Pro over those from ChatGPT-o3-mini-high (p 0.001). ChatGPT-4.5 achieved 74% accuracy, higher than residents but lower than other LLMs. Compared with ChatGPT o3-mini-high, ChatGPT-4.5 produced longer, more complex responses while responding faster (p 0.001).
Conclusion: LLMs' higher scores on the mock board examination highlight their potential as auxiliary educational tools in neurosurgical training. The high accuracy of Deepseek-R1 and the clarity of Gemini-2.0 Pro's detailed responses suggest uses with refinement as neurosurgical educational guides or in constructing board questions or training assessments.
扫码关注我们
求助内容:
应助结果提醒方式:
