Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.

IF 2.3 4区 医学 Q1 ANATOMY & MORPHOLOGY Clinical Anatomy Pub Date : 2024-11-21 DOI:10.1002/ca.24244
Volodymyr Mavrych, Paul Ganguly, Olena Bolgova
{"title":"Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.","authors":"Volodymyr Mavrych, Paul Ganguly, Olena Bolgova","doi":"10.1002/ca.24244","DOIUrl":null,"url":null,"abstract":"<p><p>The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% ± 1.9% of questions accurately, then Copilot (42.0% ± 0.0%) and ChatGPT-3.5 (41.0% ± 5.3%), followed by ChatGPT-3.5-turbo (38.5% ± 5.7%). Google PaLM 2 (34.5% ± 4.4%) and Bard (33.5% ± 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.</p>","PeriodicalId":50687,"journal":{"name":"Clinical Anatomy","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Anatomy","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/ca.24244","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ANATOMY & MORPHOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% ± 1.9% of questions accurately, then Copilot (42.0% ± 0.0%) and ChatGPT-3.5 (41.0% ± 5.3%), followed by ChatGPT-3.5-turbo (38.5% ± 5.7%). Google PaLM 2 (34.5% ± 4.4%) and Bard (33.5% ± 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在大体解剖学课程中使用大型语言模型(ChatGPT、Copilot、PaLM、Bard 和 Gemini):比较分析
随着生成式人工智能大语言模型(LLM)在包括医学教育在内的各个领域的应用日益广泛,人们对其准确性提出了质疑。我们研究的主要目的是详细比较分析六种不同 LLM(ChatGPT-4、ChatGPT-3.5-turbo、ChatGPT-3.5、Copilot、PaLM、Bard 和 Gemini)在回答医学选择题(MCQ)以及在为医科学生开设的大体解剖课程中生成上肢主题的临床场景和 MCQ 时的熟练程度和准确性。选定的聊天机器人进行了测试,回答了 50 道 USMLE 风格的 MCQ。这些问题是从医科学生的大体解剖学课程考试数据库中随机抽取的,并由三位独立专家进行了审核。从准确性、相关性和全面性方面对聊天机器人连续五次尝试回答每组问题的结果进行了评估。结果最好的是 ChatGPT-4,它准确回答了 60.5% ± 1.9% 的问题,然后是 Copilot(42.0% ± 0.0%)和 ChatGPT-3.5(41.0% ± 5.3%),接着是 ChatGPT-3.5-turbo(38.5% ± 5.7%)。Google PaLM 2(34.5% ± 4.4%)和 Bard(33.5% ± 3.0%)的结果最差。GPT-4 的总体性能在统计上更优(p
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Clinical Anatomy
Clinical Anatomy 医学-解剖学与形态学
CiteScore
5.50
自引率
12.50%
发文量
154
审稿时长
3 months
期刊介绍: Clinical Anatomy is the Official Journal of the American Association of Clinical Anatomists and the British Association of Clinical Anatomists. The goal of Clinical Anatomy is to provide a medium for the exchange of current information between anatomists and clinicians. This journal embraces anatomy in all its aspects as applied to medical practice. Furthermore, the journal assists physicians and other health care providers in keeping abreast of new methodologies for patient management and informs educators of new developments in clinical anatomy and teaching techniques. Clinical Anatomy publishes original and review articles of scientific, clinical, and educational interest. Papers covering the application of anatomic principles to the solution of clinical problems and/or the application of clinical observations to expand anatomic knowledge are welcomed.
期刊最新文献
"Practical Anatomy is to medical men what mathematics are to the physicist". Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis. Is dissection or prosection equal in dental anatomy education? Comparative assessment of three AI platforms in answering USMLE Step 1 anatomy questions or identifying anatomical structures on radiographs. Treatment of thoracic outlet syndrome to relieve chronic migraine.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1