Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment.

IF 3.8 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Academic Radiology Pub Date : 2024-09-17 DOI:10.1016/j.acra.2024.09.005

Muhammed Said Beşler,Laura Oleaga,Vanesa Junquero,Cristina Merino

{"title":"Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment.","authors":"Muhammed Said Beşler,Laura Oleaga,Vanesa Junquero,Cristina Merino","doi":"10.1016/j.acra.2024.09.005","DOIUrl":null,"url":null,"abstract":"RATIONALE AND OBJECTIVES\r\nThis study aims to evaluate the performance of generative pre-trained transformer (GPT)-4o in the complete official European Board of Radiology (EBR) exam, designed to assess radiology knowledge, skills, and competence.\r\n\r\nMATERIALS AND METHODS\r\nQuestions based on text, image, or video and in the format of multiple choice, free-text reporting, or image annotation were uploaded into GPT-4o using standardized prompting. The results were compared to the average scores of radiologists taking the exam in real time.\r\n\r\nRESULTS\r\nIn Part 1 (multiple response questions and short cases), GPT-4o outperformed both the radiologists' average scores and the maximum pass score (70.2% vs. 58.4% and 60%, respectively). In Part 2 (clinically oriented reasoning evaluation), the performance of GPT-4o was below both the radiologists' average scores and the minimum pass score (52.9% vs. 66.1% and 55%, respectively). The accuracy on questions involving ultrasound images was higher compared to other imaging modalities (accuracy rate, 87.5-100%). For video-based questions, the performance was 50.6%. The model achieved the highest accuracy on most likely diagnosis questions but showed lower accuracy in free-text reporting and direct anatomical assessment in images (100% vs. 31% and 28.6%, respectively).\r\n\r\nCONCLUSION\r\nThe abilities of GPT-4o in the official EBR exam are particularly noteworthy. This study demonstrates the potential of large language models to assist radiologists in assessing and managing cases from diagnosis to treatment or follow-up recommendations, even with zero-shot prompting.","PeriodicalId":50928,"journal":{"name":"Academic Radiology","volume":null,"pages":null},"PeriodicalIF":3.8000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.acra.2024.09.005","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

RATIONALE AND OBJECTIVES This study aims to evaluate the performance of generative pre-trained transformer (GPT)-4o in the complete official European Board of Radiology (EBR) exam, designed to assess radiology knowledge, skills, and competence. MATERIALS AND METHODS Questions based on text, image, or video and in the format of multiple choice, free-text reporting, or image annotation were uploaded into GPT-4o using standardized prompting. The results were compared to the average scores of radiologists taking the exam in real time. RESULTS In Part 1 (multiple response questions and short cases), GPT-4o outperformed both the radiologists' average scores and the maximum pass score (70.2% vs. 58.4% and 60%, respectively). In Part 2 (clinically oriented reasoning evaluation), the performance of GPT-4o was below both the radiologists' average scores and the minimum pass score (52.9% vs. 66.1% and 55%, respectively). The accuracy on questions involving ultrasound images was higher compared to other imaging modalities (accuracy rate, 87.5-100%). For video-based questions, the performance was 50.6%. The model achieved the highest accuracy on most likely diagnosis questions but showed lower accuracy in free-text reporting and direct anatomical assessment in images (100% vs. 31% and 28.6%, respectively). CONCLUSION The abilities of GPT-4o in the official EBR exam are particularly noteworthy. This study demonstrates the potential of large language models to assist radiologists in assessing and managing cases from diagnosis to treatment or follow-up recommendations, even with zero-shot prompting.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估 GPT-4o 在欧洲放射学委员会官方考试中的表现：综合评估。

本研究旨在评估生成式预训练转换器（GPT）-4o 在完整的欧洲放射学委员会（EBR）官方考试中的表现，该考试旨在评估放射学知识、技能和能力。材料与方法通过标准化提示将基于文本、图像或视频的问题以多项选择、自由文本报告或图像注释的形式上传到 GPT-4o。结果在第 1 部分（多选题和简短病例）中，GPT-4o 的成绩优于放射科医生的平均分和最高及格分（分别为 70.2% 对 58.4% 和 60%）。在第二部分（临床导向推理评估）中，GPT-4o 的成绩低于放射科医生的平均分和最低及格分（分别为 52.9% 对 66.1% 和 55%）。与其他成像方式相比，涉及超声图像的问题的准确率更高（准确率为 87.5%-100%）。对于基于视频的问题，准确率为 50.6%。该模型在最可能的诊断问题上达到了最高的准确率，但在自由文本报告和直接图像解剖评估方面的准确率较低（分别为 100% 和 31% 和 28.6%）。这项研究证明了大型语言模型在协助放射科医生评估和管理病例（从诊断到治疗或后续建议）方面的潜力，即使是在零镜头提示的情况下也是如此。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Academic Radiology 医学-核医学

CiteScore

7.60

自引率

10.40%

发文量

432

审稿时长

18 days

期刊介绍： Academic Radiology publishes original reports of clinical and laboratory investigations in diagnostic imaging, the diagnostic use of radioactive isotopes, computed tomography, positron emission tomography, magnetic resonance imaging, ultrasound, digital subtraction angiography, image-guided interventions and related techniques. It also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.