Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology.

IF 2.1 4区医学 Q2 MEDICINE, GENERAL & INTERNAL Swiss medical weekly Pub Date : 2024-10-02 DOI:10.57187/s.3547

Jessica Huwiler, Luca Oechslin, Patric Biaggi, Felix C Tanner, Christophe Alain Wyss

{"title":"Experimental assessment of the performance of artificial intelligence in solving multiple-choice board exams in cardiology.","authors":"Jessica Huwiler, Luca Oechslin, Patric Biaggi, Felix C Tanner, Christophe Alain Wyss","doi":"10.57187/s.3547","DOIUrl":null,"url":null,"abstract":"Aims: The aim of the present study was to evaluate the performance of various artificial intelligence (AI)-powered chatbots (commercially available in Switzerland up to June 2023) in solving a theoretical cardiology board exam and to compare their accuracy with that of human cardiology fellows.Methods: For the study, a set of 88 multiple-choice cardiology exam questions was used. The participating cardiology fellows and selected chatbots were presented with these questions. The evaluation metrics included Top-1 and Top-2 accuracy, assessing the ability of chatbots and fellows to select the correct answer.Results: Among the cardiology fellows, all 36 participants successfully passed the exam with a median accuracy of 98% (IQR 91-99%, range from 78% to 100%). However, the performance of the chatbots varied. Only one chatbot, Jasper quality, achieved the minimum pass rate of 73% correct answers. Most chatbots demonstrated a median Top-1 accuracy of 47% (IQR 44-53%, range from 42% to 73%), while Top-2 accuracy provided a modest improvement, resulting in a median accuracy of 67% (IQR 65-72%, range from 61% to 82%). Even with this advantage, only two chatbots, Jasper quality and ChatGPT plus 4.0, would have passed the exam. Similar results were observed when picture-based questions were excluded from the dataset.Conclusions: Overall, the study suggests that most current language-based chatbots have limitations in accurately solving theoretical medical board exams. In general, currently widely available chatbots fell short of achieving a passing score in a theoretical cardiology board exam. Nevertheless, a few showed promising results. Further improvements in artificial intelligence language models may lead to better performance in medical knowledge applications in the future.","PeriodicalId":22111,"journal":{"name":"Swiss medical weekly","volume":"154 ","pages":"3547"},"PeriodicalIF":2.1000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Swiss medical weekly","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.57187/s.3547","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Aims: The aim of the present study was to evaluate the performance of various artificial intelligence (AI)-powered chatbots (commercially available in Switzerland up to June 2023) in solving a theoretical cardiology board exam and to compare their accuracy with that of human cardiology fellows.

Methods: For the study, a set of 88 multiple-choice cardiology exam questions was used. The participating cardiology fellows and selected chatbots were presented with these questions. The evaluation metrics included Top-1 and Top-2 accuracy, assessing the ability of chatbots and fellows to select the correct answer.

Results: Among the cardiology fellows, all 36 participants successfully passed the exam with a median accuracy of 98% (IQR 91-99%, range from 78% to 100%). However, the performance of the chatbots varied. Only one chatbot, Jasper quality, achieved the minimum pass rate of 73% correct answers. Most chatbots demonstrated a median Top-1 accuracy of 47% (IQR 44-53%, range from 42% to 73%), while Top-2 accuracy provided a modest improvement, resulting in a median accuracy of 67% (IQR 65-72%, range from 61% to 82%). Even with this advantage, only two chatbots, Jasper quality and ChatGPT plus 4.0, would have passed the exam. Similar results were observed when picture-based questions were excluded from the dataset.

Conclusions: Overall, the study suggests that most current language-based chatbots have limitations in accurately solving theoretical medical board exams. In general, currently widely available chatbots fell short of achieving a passing score in a theoretical cardiology board exam. Nevertheless, a few showed promising results. Further improvements in artificial intelligence language models may lead to better performance in medical knowledge applications in the future.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对人工智能在解决心脏病学多选题考试中的表现进行实验评估。

目的：本研究旨在评估各种人工智能（AI）驱动的聊天机器人（2023 年 6 月前可在瑞士购买）在解决心脏病学理论考试中的表现，并将其准确性与人类心脏病学研究员的准确性进行比较：研究使用了一套 88 道多项选择心脏病学考试题。参与研究的心脏病学研究员和选定的聊天机器人都收到了这些问题。评估指标包括Top-1和Top-2准确率，评估聊天机器人和研究员选择正确答案的能力：结果：在心脏病学研究员中，所有 36 名参与者都成功通过了考试，准确率中位数为 98%（IQR 91-99%，范围为 78% 至 100% ）。但是，聊天机器人的表现各不相同。只有一个名为 "Jasper quality "的聊天机器人达到了 73% 的最低通过率。大多数聊天机器人的 Top-1 准确率中位数为 47%（IQR 44-53%，范围从 42% 到 73%），而 Top-2 准确率略有提高，准确率中位数为 67%（IQR 65-72%，范围从 61% 到 82%）。即使有这样的优势，也只有 Jasper quality 和 ChatGPT plus 4.0 这两个聊天机器人能通过考试。如果数据集中不包括基于图片的问题，也会观察到类似的结果：总之，研究表明，目前大多数基于语言的聊天机器人在准确解决医学理论考试方面存在局限性。总的来说，目前广泛使用的聊天机器人在心脏病学理论考试中无法达到及格分数。不过，也有少数聊天机器人取得了可喜的成绩。人工智能语言模型的进一步改进可能会在未来的医学知识应用中带来更好的表现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Swiss medical weekly 医学-医学：内科

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

3-8 weeks

期刊介绍： The Swiss Medical Weekly accepts for consideration original and review articles from all fields of medicine. The quality of SMW publications is guaranteed by a consistent policy of rigorous single-blind peer review. All editorial decisions are made by research-active academics.