Muhammed Said Beşler,Laura Oleaga,Vanesa Junquero,Cristina Merino
{"title":"Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment.","authors":"Muhammed Said Beşler,Laura Oleaga,Vanesa Junquero,Cristina Merino","doi":"10.1016/j.acra.2024.09.005","DOIUrl":null,"url":null,"abstract":"RATIONALE AND OBJECTIVES\r\nThis study aims to evaluate the performance of generative pre-trained transformer (GPT)-4o in the complete official European Board of Radiology (EBR) exam, designed to assess radiology knowledge, skills, and competence.\r\n\r\nMATERIALS AND METHODS\r\nQuestions based on text, image, or video and in the format of multiple choice, free-text reporting, or image annotation were uploaded into GPT-4o using standardized prompting. The results were compared to the average scores of radiologists taking the exam in real time.\r\n\r\nRESULTS\r\nIn Part 1 (multiple response questions and short cases), GPT-4o outperformed both the radiologists' average scores and the maximum pass score (70.2% vs. 58.4% and 60%, respectively). In Part 2 (clinically oriented reasoning evaluation), the performance of GPT-4o was below both the radiologists' average scores and the minimum pass score (52.9% vs. 66.1% and 55%, respectively). The accuracy on questions involving ultrasound images was higher compared to other imaging modalities (accuracy rate, 87.5-100%). For video-based questions, the performance was 50.6%. The model achieved the highest accuracy on most likely diagnosis questions but showed lower accuracy in free-text reporting and direct anatomical assessment in images (100% vs. 31% and 28.6%, respectively).\r\n\r\nCONCLUSION\r\nThe abilities of GPT-4o in the official EBR exam are particularly noteworthy. This study demonstrates the potential of large language models to assist radiologists in assessing and managing cases from diagnosis to treatment or follow-up recommendations, even with zero-shot prompting.","PeriodicalId":50928,"journal":{"name":"Academic Radiology","volume":null,"pages":null},"PeriodicalIF":3.8000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.acra.2024.09.005","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
RATIONALE AND OBJECTIVES
This study aims to evaluate the performance of generative pre-trained transformer (GPT)-4o in the complete official European Board of Radiology (EBR) exam, designed to assess radiology knowledge, skills, and competence.
MATERIALS AND METHODS
Questions based on text, image, or video and in the format of multiple choice, free-text reporting, or image annotation were uploaded into GPT-4o using standardized prompting. The results were compared to the average scores of radiologists taking the exam in real time.
RESULTS
In Part 1 (multiple response questions and short cases), GPT-4o outperformed both the radiologists' average scores and the maximum pass score (70.2% vs. 58.4% and 60%, respectively). In Part 2 (clinically oriented reasoning evaluation), the performance of GPT-4o was below both the radiologists' average scores and the minimum pass score (52.9% vs. 66.1% and 55%, respectively). The accuracy on questions involving ultrasound images was higher compared to other imaging modalities (accuracy rate, 87.5-100%). For video-based questions, the performance was 50.6%. The model achieved the highest accuracy on most likely diagnosis questions but showed lower accuracy in free-text reporting and direct anatomical assessment in images (100% vs. 31% and 28.6%, respectively).
CONCLUSION
The abilities of GPT-4o in the official EBR exam are particularly noteworthy. This study demonstrates the potential of large language models to assist radiologists in assessing and managing cases from diagnosis to treatment or follow-up recommendations, even with zero-shot prompting.
期刊介绍:
Academic Radiology publishes original reports of clinical and laboratory investigations in diagnostic imaging, the diagnostic use of radioactive isotopes, computed tomography, positron emission tomography, magnetic resonance imaging, ultrasound, digital subtraction angiography, image-guided interventions and related techniques. It also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.