Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology.

IF 2.1 4区 医学 Q2 MEDICINE, GENERAL & INTERNAL Swiss medical weekly Pub Date : 2024-10-02 DOI:10.57187/s.3547
Jessica Huwiler, Luca Oechslin, Patric Biaggi, Felix C Tanner, Christophe Alain Wyss
{"title":"Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology.","authors":"Jessica Huwiler, Luca Oechslin, Patric Biaggi, Felix C Tanner, Christophe Alain Wyss","doi":"10.57187/s.3547","DOIUrl":null,"url":null,"abstract":"<p><strong>Aims: </strong>The aim of the present study was to evaluate the performance of various artificial intelligence (AI)-powered chatbots (commercially available in Switzerland up to June 2023) in solving a theoretical cardiology board exam and to compare their accuracy with that of human cardiology fellows.</p><p><strong>Methods: </strong>For the study, a set of 88 multiple-choice cardiology exam questions was used. The participating cardiology fellows and selected chatbots were presented with these questions. The evaluation metrics included Top-1 and Top-2 accuracy, assessing the ability of chatbots and fellows to select the correct answer.</p><p><strong>Results: </strong>Among the cardiology fellows, all 36 participants successfully passed the exam with a median accuracy of 98% (IQR 91-99%, range from 78% to 100%). However, the performance of the chatbots varied. Only one chatbot, Jasper quality, achieved the minimum pass rate of 73% correct answers. Most chatbots demonstrated a median Top-1 accuracy of 47% (IQR 44-53%, range from 42% to 73%), while Top-2 accuracy provided a modest improvement, resulting in a median accuracy of 67% (IQR 65-72%, range from 61% to 82%). Even with this advantage, only two chatbots, Jasper quality and ChatGPT plus 4.0, would have passed the exam. Similar results were observed when picture-based questions were excluded from the dataset.</p><p><strong>Conclusions: </strong>Overall, the study suggests that most current language-based chatbots have limitations in accurately solving theoretical medical board exams. In general, currently widely available chatbots fell short of achieving a passing score in a theoretical cardiology board exam. Nevertheless, a few showed promising results. Further improvements in artificial intelligence language models may lead to better performance in medical knowledge applications in the future.</p>","PeriodicalId":22111,"journal":{"name":"Swiss medical weekly","volume":"154 ","pages":"3547"},"PeriodicalIF":2.1000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Swiss medical weekly","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.57187/s.3547","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

Abstract

Aims: The aim of the present study was to evaluate the performance of various artificial intelligence (AI)-powered chatbots (commercially available in Switzerland up to June 2023) in solving a theoretical cardiology board exam and to compare their accuracy with that of human cardiology fellows.

Methods: For the study, a set of 88 multiple-choice cardiology exam questions was used. The participating cardiology fellows and selected chatbots were presented with these questions. The evaluation metrics included Top-1 and Top-2 accuracy, assessing the ability of chatbots and fellows to select the correct answer.

Results: Among the cardiology fellows, all 36 participants successfully passed the exam with a median accuracy of 98% (IQR 91-99%, range from 78% to 100%). However, the performance of the chatbots varied. Only one chatbot, Jasper quality, achieved the minimum pass rate of 73% correct answers. Most chatbots demonstrated a median Top-1 accuracy of 47% (IQR 44-53%, range from 42% to 73%), while Top-2 accuracy provided a modest improvement, resulting in a median accuracy of 67% (IQR 65-72%, range from 61% to 82%). Even with this advantage, only two chatbots, Jasper quality and ChatGPT plus 4.0, would have passed the exam. Similar results were observed when picture-based questions were excluded from the dataset.

Conclusions: Overall, the study suggests that most current language-based chatbots have limitations in accurately solving theoretical medical board exams. In general, currently widely available chatbots fell short of achieving a passing score in a theoretical cardiology board exam. Nevertheless, a few showed promising results. Further improvements in artificial intelligence language models may lead to better performance in medical knowledge applications in the future.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
对人工智能在解决心脏病学多选题考试中的表现进行实验评估。
目的:本研究旨在评估各种人工智能(AI)驱动的聊天机器人(2023 年 6 月前可在瑞士购买)在解决心脏病学理论考试中的表现,并将其准确性与人类心脏病学研究员的准确性进行比较:研究使用了一套 88 道多项选择心脏病学考试题。参与研究的心脏病学研究员和选定的聊天机器人都收到了这些问题。评估指标包括Top-1和Top-2准确率,评估聊天机器人和研究员选择正确答案的能力:结果:在心脏病学研究员中,所有 36 名参与者都成功通过了考试,准确率中位数为 98%(IQR 91-99%,范围为 78% 至 100% )。但是,聊天机器人的表现各不相同。只有一个名为 "Jasper quality "的聊天机器人达到了 73% 的最低通过率。大多数聊天机器人的 Top-1 准确率中位数为 47%(IQR 44-53%,范围从 42% 到 73%),而 Top-2 准确率略有提高,准确率中位数为 67%(IQR 65-72%,范围从 61% 到 82%)。即使有这样的优势,也只有 Jasper quality 和 ChatGPT plus 4.0 这两个聊天机器人能通过考试。如果数据集中不包括基于图片的问题,也会观察到类似的结果:总之,研究表明,目前大多数基于语言的聊天机器人在准确解决医学理论考试方面存在局限性。总的来说,目前广泛使用的聊天机器人在心脏病学理论考试中无法达到及格分数。不过,也有少数聊天机器人取得了可喜的成绩。人工智能语言模型的进一步改进可能会在未来的医学知识应用中带来更好的表现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Swiss medical weekly
Swiss medical weekly 医学-医学:内科
CiteScore
5.00
自引率
0.00%
发文量
0
审稿时长
3-8 weeks
期刊介绍: The Swiss Medical Weekly accepts for consideration original and review articles from all fields of medicine. The quality of SMW publications is guaranteed by a consistent policy of rigorous single-blind peer review. All editorial decisions are made by research-active academics.
期刊最新文献
Supplementum 284: Abstracts of the 56th Annual meeting of the Swiss Society of Nephrology. Safety of oral immunotherapy for cashew nut and peanut allergy in children - a retrospective single-centre study. Cardiac amyloidosis. Blood pressure control and antihypertensive treatment in Swiss general practice: a cross-sectional study using routine data. Exploring the real-world management of catheter-associated urinary tract infections by Swiss general practitioners and urologists: insights from an online survey.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1