Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.

IF 3 4区医学 Q2 CARDIAC & CARDIOVASCULAR SYSTEMS Journal of Nuclear Cardiology Pub Date : 2024-11-29 DOI:10.1016/j.nuclcard.2024.102089

Valerie Builoff, Aakash Shanbhag, Robert Jh Miller, Damini Dey, Joanna X Liang, Kathleen Flood, Jamieson M Bourque, Panithaya Chareonthaitawee, Lawrence M Phillips, Piotr J Slomka

{"title":"Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.","authors":"Valerie Builoff, Aakash Shanbhag, Robert Jh Miller, Damini Dey, Joanna X Liang, Kathleen Flood, Jamieson M Bourque, Panithaya Chareonthaitawee, Lawrence M Phillips, Piotr J Slomka","doi":"10.1016/j.nuclcard.2024.102089","DOIUrl":null,"url":null,"abstract":"Background: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.Methods: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.Results: GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%-64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all).Conclusion: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.","PeriodicalId":16476,"journal":{"name":"Journal of Nuclear Cardiology","volume":" ","pages":"102089"},"PeriodicalIF":3.0000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Nuclear Cardiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.nuclcard.2024.102089","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.

Methods: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.

Results: GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%-64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all).

Conclusion: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估人工智能在核心脏病学中的熟练程度：大型语言模型参加董事会准备考试。

背景：以往的研究评估了大型语言模型（LLMs）在医学学科中的能力；然而，很少有人关注图像分析，也没有人专门关注心血管成像或核心脏病学。本研究评估了四个llm - GPT-4， GPT-4 Turbo, GPT-4omni (gpt - 40) （Open AI）和Gemini (b谷歌Inc.) -在回答2023年美国核心脏病学会委员会准备考试的问题时，反映了核心脏病认证委员会（CBNC）考试的范围。方法：采用168道题，141道为纯文本题，27道为图像题，分为4个部分，与CBNC考试相对应。每个LLM都有相同的标准化提示，并在每个部分应用30次，以考虑随机性。对除gpt - 40外的所有模型进行了六周的性能评估。McNemar的测试比较了正确的反应比例。结果：GPT-4、Gemini、GPT4-Turbo和gpt - 40的正确率中位数分别为56.8%（95%置信区间55.4% ~ 58.0%）、40.5%（39.9% ~ 42.9%）、60.7%（59.9% ~ 61.3%）和63.1%（62.5 ~ 64.3%）。gpt40显著优于其他模型（p=0.007与GPT-4Turbo相比，p）结论：gpt - 40在四种llm中表现出优越的性能，其得分可能在或刚好在通过类似CBNC考试的测试所需的范围内。虽然在医学图像解释方面还需要改进，但gpt - 40显示出支持医生回答基于文本的临床问题的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Nuclear Cardiology 医学-核医学

CiteScore

5.30

自引率

20.80%

发文量

249

审稿时长

4-8 weeks

期刊介绍： Journal of Nuclear Cardiology is the only journal in the world devoted to this dynamic and growing subspecialty. Physicians and technologists value the Journal not only for its peer-reviewed articles, but also for its timely discussions about the current and future role of nuclear cardiology. Original articles address all aspects of nuclear cardiology, including interpretation, diagnosis, imaging equipment, and use of radiopharmaceuticals. As the official publication of the American Society of Nuclear Cardiology, the Journal also brings readers the latest information emerging from the Society''s task forces and publishes guidelines and position papers as they are adopted.