评估人工智能在核心脏病学中的熟练程度:大型语言模型参加董事会准备考试。

IF 3 4区 医学 Q2 CARDIAC & CARDIOVASCULAR SYSTEMS Journal of Nuclear Cardiology Pub Date : 2024-11-29 DOI:10.1016/j.nuclcard.2024.102089
Valerie Builoff, Aakash Shanbhag, Robert Jh Miller, Damini Dey, Joanna X Liang, Kathleen Flood, Jamieson M Bourque, Panithaya Chareonthaitawee, Lawrence M Phillips, Piotr J Slomka
{"title":"评估人工智能在核心脏病学中的熟练程度:大型语言模型参加董事会准备考试。","authors":"Valerie Builoff, Aakash Shanbhag, Robert Jh Miller, Damini Dey, Joanna X Liang, Kathleen Flood, Jamieson M Bourque, Panithaya Chareonthaitawee, Lawrence M Phillips, Piotr J Slomka","doi":"10.1016/j.nuclcard.2024.102089","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.</p><p><strong>Methods: </strong>We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.</p><p><strong>Results: </strong>GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%-64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all).</p><p><strong>Conclusion: </strong>GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.</p>","PeriodicalId":16476,"journal":{"name":"Journal of Nuclear Cardiology","volume":" ","pages":"102089"},"PeriodicalIF":3.0000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.\",\"authors\":\"Valerie Builoff, Aakash Shanbhag, Robert Jh Miller, Damini Dey, Joanna X Liang, Kathleen Flood, Jamieson M Bourque, Panithaya Chareonthaitawee, Lawrence M Phillips, Piotr J Slomka\",\"doi\":\"10.1016/j.nuclcard.2024.102089\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.</p><p><strong>Methods: </strong>We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.</p><p><strong>Results: </strong>GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%-64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all).</p><p><strong>Conclusion: </strong>GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.</p>\",\"PeriodicalId\":16476,\"journal\":{\"name\":\"Journal of Nuclear Cardiology\",\"volume\":\" \",\"pages\":\"102089\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-11-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Nuclear Cardiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.nuclcard.2024.102089\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CARDIAC & CARDIOVASCULAR SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Nuclear Cardiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.nuclcard.2024.102089","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

背景:以往的研究评估了大型语言模型(LLMs)在医学学科中的能力;然而,很少有人关注图像分析,也没有人专门关注心血管成像或核心脏病学。本研究评估了四个llm - GPT-4, GPT-4 Turbo, GPT-4omni (gpt - 40) (Open AI)和Gemini (b谷歌Inc.) -在回答2023年美国核心脏病学会委员会准备考试的问题时,反映了核心脏病认证委员会(CBNC)考试的范围。方法:采用168道题,141道为纯文本题,27道为图像题,分为4个部分,与CBNC考试相对应。每个LLM都有相同的标准化提示,并在每个部分应用30次,以考虑随机性。对除gpt - 40外的所有模型进行了六周的性能评估。McNemar的测试比较了正确的反应比例。结果:GPT-4、Gemini、GPT4-Turbo和gpt - 40的正确率中位数分别为56.8%(95%置信区间55.4% ~ 58.0%)、40.5%(39.9% ~ 42.9%)、60.7%(59.9% ~ 61.3%)和63.1%(62.5 ~ 64.3%)。gpt40显著优于其他模型(p=0.007与GPT-4Turbo相比,p)结论:gpt - 40在四种llm中表现出优越的性能,其得分可能在或刚好在通过类似CBNC考试的测试所需的范围内。虽然在医学图像解释方面还需要改进,但gpt - 40显示出支持医生回答基于文本的临床问题的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.

Background: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination.

Methods: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions.

Results: GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4% - 58.0%), 40.5% (39.9% - 42.9%), 60.7% (59.5% - 61.3%), and 63.1% (62.5%-64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all).

Conclusion: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT-4o shows potential to support physicians in answering text-based clinical questions.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
5.30
自引率
20.80%
发文量
249
审稿时长
4-8 weeks
期刊介绍: Journal of Nuclear Cardiology is the only journal in the world devoted to this dynamic and growing subspecialty. Physicians and technologists value the Journal not only for its peer-reviewed articles, but also for its timely discussions about the current and future role of nuclear cardiology. Original articles address all aspects of nuclear cardiology, including interpretation, diagnosis, imaging equipment, and use of radiopharmaceuticals. As the official publication of the American Society of Nuclear Cardiology, the Journal also brings readers the latest information emerging from the Society''s task forces and publishes guidelines and position papers as they are adopted.
期刊最新文献
XTR003, a Fatty-acid Metabolism PET Tracer: A Phase I Study to Evaluate the Safety, Biodistribution, Radiation Dosimetry and Pharmacokinetics in Healthy Volunteers. Is it solid enough? Diagnostic performance of solid-state detector technology without attenuation CT against invasive angiography. Feasibility of the Absolute Quantification and Left Ventricular Segmentation of Cardiac Sympathetic Innervation in Wild-type Transthyretin Amyloidosis Cardiomyopathy with [123I]-MIBG SPECT/CT: the I-NERVE study. Editorial Board Transient ischemic dilatation in cardiac amyloidosis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1