Valerie Builoff, Aakash Shanbhag, Robert Jh Miller, Damini Dey, Joanna X Liang, Kathleen Flood, Jamieson M Bourque, Panithaya Chareonthaitawee, Lawrence M Phillips, Piotr J Slomka
{"title":"Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.","authors":"Valerie Builoff, Aakash Shanbhag, Robert Jh Miller, Damini Dey, Joanna X Liang, Kathleen Flood, Jamieson M Bourque, Panithaya Chareonthaitawee, Lawrence M Phillips, Piotr J Slomka","doi":"10.1016/j.nuclcard.2024.102089","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.</p><p><strong>Methods: </strong>We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.</p><p><strong>Results: </strong>GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%-64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all).</p><p><strong>Conclusion: </strong>GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.</p>","PeriodicalId":16476,"journal":{"name":"Journal of Nuclear Cardiology","volume":" ","pages":"102089"},"PeriodicalIF":3.0000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Nuclear Cardiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.nuclcard.2024.102089","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.
Methods: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.
Results: GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%-64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all).
Conclusion: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.
期刊介绍:
Journal of Nuclear Cardiology is the only journal in the world devoted to this dynamic and growing subspecialty. Physicians and technologists value the Journal not only for its peer-reviewed articles, but also for its timely discussions about the current and future role of nuclear cardiology. Original articles address all aspects of nuclear cardiology, including interpretation, diagnosis, imaging equipment, and use of radiopharmaceuticals. As the official publication of the American Society of Nuclear Cardiology, the Journal also brings readers the latest information emerging from the Society''s task forces and publishes guidelines and position papers as they are adopted.