Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES JMIR Medical Education Pub Date : 2025-01-16 DOI:10.2196/64284
Boxiong Wei
{"title":"Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis.","authors":"Boxiong Wei","doi":"10.2196/64284","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy.</p><p><strong>Objective: </strong>This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams.</p><p><strong>Methods: </strong>A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 tests and ANOVA.</p><p><strong>Results: </strong>GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18-0.60) for Claude, 0.24 (95% CI 0.13-0.44) for Bard, and 0.25 (95% CI 0.14-0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27-0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions.</p><p><strong>Conclusions: </strong>GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models' effectiveness in specialized fields like radiology.</p>","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"11 ","pages":"e64284"},"PeriodicalIF":3.2000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11756834/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/64284","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy.

Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams.

Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 tests and ANOVA.

Results: GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18-0.60) for Claude, 0.24 (95% CI 0.13-0.44) for Bard, and 0.25 (95% CI 0.14-0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27-0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions.

Conclusions: GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models' effectiveness in specialized fields like radiology.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大语言模型在放射学委员会考试中的表现评估和意义:前瞻性比较分析。
背景:人工智能的进步使大型语言模型能够显著影响放射学教育和诊断准确性。目的:评价GPT-4、Claude、Bard、同义乾文、Gemini Pro等主流大型语言模型在放射学委员会考试中的表现。方法:对不带图像的放射学委员会考试中150道选择题进行比较分析。评估模型对基于文本的问题的准确性,并使用χ2检验和方差分析按认知水平和医学专业进行分类。结果:GPT-4的准确率最高(83.3%,125/150),显著优于其他所有模型。具体来说,Claude的准确率为62% (93/150;结论:GPT-4和通益千文在医学教育培训中具有较好的应用前景。该研究强调需要特定领域的训练数据集,以增强大型语言模型在放射学等专业领域的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
JMIR Medical Education
JMIR Medical Education Social Sciences-Education
CiteScore
6.90
自引率
5.60%
发文量
54
审稿时长
8 weeks
期刊最新文献
Student Satisfaction in Social Media-Based Learning Environments: Development, Validation, and Psychometric Evaluation of the CuSAERS (Questionnaire of Satisfaction With Educational Activities Performed on Social Media). The Need for Health Care Innovation Training in Medical Education. Using Electronic Health Data to Deliver an Adaptive Online Learning Solution to Emergency Trainees: Mixed Methods Pilot Study. Pharmacists' Attitudes, Perceptions, and Preferences Regarding Continuing Education: Cross-Sectional Study in Vietnam. Twelve Practical Tips for Integrating AI Into Medical Education: Tutorial to Support Educators Across Teaching, Research, Administration, and Ethical Domains.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1