Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment.

IF 3.8 2区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Academic Radiology Pub Date : 2024-09-17 DOI:10.1016/j.acra.2024.09.005
Muhammed Said Beşler,Laura Oleaga,Vanesa Junquero,Cristina Merino
{"title":"Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment.","authors":"Muhammed Said Beşler,Laura Oleaga,Vanesa Junquero,Cristina Merino","doi":"10.1016/j.acra.2024.09.005","DOIUrl":null,"url":null,"abstract":"RATIONALE AND OBJECTIVES\r\nThis study aims to evaluate the performance of generative pre-trained transformer (GPT)-4o in the complete official European Board of Radiology (EBR) exam, designed to assess radiology knowledge, skills, and competence.\r\n\r\nMATERIALS AND METHODS\r\nQuestions based on text, image, or video and in the format of multiple choice, free-text reporting, or image annotation were uploaded into GPT-4o using standardized prompting. The results were compared to the average scores of radiologists taking the exam in real time.\r\n\r\nRESULTS\r\nIn Part 1 (multiple response questions and short cases), GPT-4o outperformed both the radiologists' average scores and the maximum pass score (70.2% vs. 58.4% and 60%, respectively). In Part 2 (clinically oriented reasoning evaluation), the performance of GPT-4o was below both the radiologists' average scores and the minimum pass score (52.9% vs. 66.1% and 55%, respectively). The accuracy on questions involving ultrasound images was higher compared to other imaging modalities (accuracy rate, 87.5-100%). For video-based questions, the performance was 50.6%. The model achieved the highest accuracy on most likely diagnosis questions but showed lower accuracy in free-text reporting and direct anatomical assessment in images (100% vs. 31% and 28.6%, respectively).\r\n\r\nCONCLUSION\r\nThe abilities of GPT-4o in the official EBR exam are particularly noteworthy. This study demonstrates the potential of large language models to assist radiologists in assessing and managing cases from diagnosis to treatment or follow-up recommendations, even with zero-shot prompting.","PeriodicalId":50928,"journal":{"name":"Academic Radiology","volume":null,"pages":null},"PeriodicalIF":3.8000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.acra.2024.09.005","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

RATIONALE AND OBJECTIVES This study aims to evaluate the performance of generative pre-trained transformer (GPT)-4o in the complete official European Board of Radiology (EBR) exam, designed to assess radiology knowledge, skills, and competence. MATERIALS AND METHODS Questions based on text, image, or video and in the format of multiple choice, free-text reporting, or image annotation were uploaded into GPT-4o using standardized prompting. The results were compared to the average scores of radiologists taking the exam in real time. RESULTS In Part 1 (multiple response questions and short cases), GPT-4o outperformed both the radiologists' average scores and the maximum pass score (70.2% vs. 58.4% and 60%, respectively). In Part 2 (clinically oriented reasoning evaluation), the performance of GPT-4o was below both the radiologists' average scores and the minimum pass score (52.9% vs. 66.1% and 55%, respectively). The accuracy on questions involving ultrasound images was higher compared to other imaging modalities (accuracy rate, 87.5-100%). For video-based questions, the performance was 50.6%. The model achieved the highest accuracy on most likely diagnosis questions but showed lower accuracy in free-text reporting and direct anatomical assessment in images (100% vs. 31% and 28.6%, respectively). CONCLUSION The abilities of GPT-4o in the official EBR exam are particularly noteworthy. This study demonstrates the potential of large language models to assist radiologists in assessing and managing cases from diagnosis to treatment or follow-up recommendations, even with zero-shot prompting.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估 GPT-4o 在欧洲放射学委员会官方考试中的表现:综合评估。
本研究旨在评估生成式预训练转换器(GPT)-4o 在完整的欧洲放射学委员会(EBR)官方考试中的表现,该考试旨在评估放射学知识、技能和能力。材料与方法通过标准化提示将基于文本、图像或视频的问题以多项选择、自由文本报告或图像注释的形式上传到 GPT-4o。结果在第 1 部分(多选题和简短病例)中,GPT-4o 的成绩优于放射科医生的平均分和最高及格分(分别为 70.2% 对 58.4% 和 60%)。在第二部分(临床导向推理评估)中,GPT-4o 的成绩低于放射科医生的平均分和最低及格分(分别为 52.9% 对 66.1% 和 55%)。与其他成像方式相比,涉及超声图像的问题的准确率更高(准确率为 87.5%-100%)。对于基于视频的问题,准确率为 50.6%。该模型在最可能的诊断问题上达到了最高的准确率,但在自由文本报告和直接图像解剖评估方面的准确率较低(分别为 100% 和 31% 和 28.6%)。这项研究证明了大型语言模型在协助放射科医生评估和管理病例(从诊断到治疗或后续建议)方面的潜力,即使是在零镜头提示的情况下也是如此。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Academic Radiology
Academic Radiology 医学-核医学
CiteScore
7.60
自引率
10.40%
发文量
432
审稿时长
18 days
期刊介绍: Academic Radiology publishes original reports of clinical and laboratory investigations in diagnostic imaging, the diagnostic use of radioactive isotopes, computed tomography, positron emission tomography, magnetic resonance imaging, ultrasound, digital subtraction angiography, image-guided interventions and related techniques. It also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.
期刊最新文献
Clinical Impact of Radiologist's Alert System on Patient Care for High-risk Incidental CT Findings: A Machine Learning-Based Risk Factor Analysis. Magnetic Resonance Imaging-Based Radiomics of Axial and Sagittal Orientation in Pregnant Patients with Suspected Placenta Accreta Spectrum. Navigating a Radiology Conference: A Comprehensive Guide for Learners. Radiomics Combined with ACR TI-RADS for Thyroid Nodules: Diagnostic Performance, Unnecessary Biopsy Rate, and Nomogram Construction. The association between FLAIR vascular hyperintensities and outcomes in patients with border zone infarcts treated with medical therapy may vary with the infarct subtype.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1