Volodymyr Mavrych, Ahmed Yaqinuddin, Olena Bolgova
{"title":"Claude, ChatGPT, Copilot, and Gemini Performance versus Students in Different Topics of Neuroscience.","authors":"Volodymyr Mavrych, Ahmed Yaqinuddin, Olena Bolgova","doi":"10.1152/advan.00093.2024","DOIUrl":null,"url":null,"abstract":"<p><p>Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.5, GPT-4-1106 (OpenAI), Copilot free version (Microsoft), and Gemini 1.5 Flash (Google) versus students on MCQs from the medical neuroscience course database to evaluate chatbots reliability. 5 successive attempts of each chatbot to answer 200 USMLE-style questions were evaluated based on accuracy, relevance, and comprehensiveness. MCQs were categorized into 12 categories/topics. The results indicated that at the current level of development, selected AI-driven chatbots, on average, can accurately answer 67.2% of MCQs from the medical neuroscience course, which is 7.4% below the students' average. However, Claude and GPT-4 outperformed other chatbots with 83% and 81.7% correct answers, which is better than the average student result. They followed by Copilot - 59.5%, GPT-3.5 - 58.3%, and Gemini - 53.6%. Concerning different categories, Neurocytology, Embryology, and Diencephalon were the three best topics, with average results of 78.1% - 86.7%, and the lowest results were Brainstem, Special senses, and Cerebellum, with 54.4% - 57.7% correct answers. Our study suggested that Claude and GPT-4 are currently two of the most evolved chatbots. They exhibit proficiency in answering MCQs related to neuroscience that surpasses that of the average medical student. This breakthrough indicates a significant milestone in how AI can supplement and enhance educational tools and techniques.</p>","PeriodicalId":50852,"journal":{"name":"Advances in Physiology Education","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Physiology Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1152/advan.00093.2024","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.5, GPT-4-1106 (OpenAI), Copilot free version (Microsoft), and Gemini 1.5 Flash (Google) versus students on MCQs from the medical neuroscience course database to evaluate chatbots reliability. 5 successive attempts of each chatbot to answer 200 USMLE-style questions were evaluated based on accuracy, relevance, and comprehensiveness. MCQs were categorized into 12 categories/topics. The results indicated that at the current level of development, selected AI-driven chatbots, on average, can accurately answer 67.2% of MCQs from the medical neuroscience course, which is 7.4% below the students' average. However, Claude and GPT-4 outperformed other chatbots with 83% and 81.7% correct answers, which is better than the average student result. They followed by Copilot - 59.5%, GPT-3.5 - 58.3%, and Gemini - 53.6%. Concerning different categories, Neurocytology, Embryology, and Diencephalon were the three best topics, with average results of 78.1% - 86.7%, and the lowest results were Brainstem, Special senses, and Cerebellum, with 54.4% - 57.7% correct answers. Our study suggested that Claude and GPT-4 are currently two of the most evolved chatbots. They exhibit proficiency in answering MCQs related to neuroscience that surpasses that of the average medical student. This breakthrough indicates a significant milestone in how AI can supplement and enhance educational tools and techniques.
期刊介绍:
Advances in Physiology Education promotes and disseminates educational scholarship in order to enhance teaching and learning of physiology, neuroscience and pathophysiology. The journal publishes peer-reviewed descriptions of innovations that improve teaching in the classroom and laboratory, essays on education, and review articles based on our current understanding of physiological mechanisms. Submissions that evaluate new technologies for teaching and research, and educational pedagogy, are especially welcome. The audience for the journal includes educators at all levels: K–12, undergraduate, graduate, and professional programs.