An Exploratory Analysis of ChatGPT Compared to Human Performance With the Anesthesiology Oral Board Examination: Initial Insights and Implications.

Anesthesia & Analgesia Pub Date : 2024-09-13 DOI:10.1213/ane.0000000000006875

Samuel N Blacker,Fei Chen,Daniel Winecoff,Benjamin L Antonio,Harendra Arora,Bryan J Hierlmeier,Rachel M Kacmar,Anthony N Passannante,Anthony R Plunkett,David Zvara,Benjamin Cobb,Alexander Doyal,Daniel Rosenkrans,Kenneth Bradbury Brown,Michael A Gonzalez,Courtney Hood,Tiffany T Pham,Abhijit V Lele,Lesley Hall,Ameer Ali,Robert S Isaak

{"title":"An Exploratory Analysis of ChatGPT Compared to Human Performance With the Anesthesiology Oral Board Examination: Initial Insights and Implications.","authors":"Samuel N Blacker,Fei Chen,Daniel Winecoff,Benjamin L Antonio,Harendra Arora,Bryan J Hierlmeier,Rachel M Kacmar,Anthony N Passannante,Anthony R Plunkett,David Zvara,Benjamin Cobb,Alexander Doyal,Daniel Rosenkrans,Kenneth Bradbury Brown,Michael A Gonzalez,Courtney Hood,Tiffany T Pham,Abhijit V Lele,Lesley Hall,Ameer Ali,Robert S Isaak","doi":"10.1213/ane.0000000000006875","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nChat Generative Pre-Trained Transformer (ChatGPT) has been tested and has passed various high-level examinations. However, it has not been tested on an examination such as the American Board of Anesthesiology (ABA) Standardized Oral Examination (SOE). The SOE is designed to assess higher-level competencies, such as judgment, organization, adaptability to unexpected clinical changes, and presentation of information.\r\n\r\nMETHODS\r\nFour anesthesiology fellows were examined on 2 sample ABA SOEs. Their answers were compared to those produced by the same questions asked to ChatGPT. All human and ChatGPT responses were transcribed, randomized by module, and then reproduced as complete examinations, using a commercially available software-based human voice replicator. Eight ABA applied examiners listened to and scored the topic and modules from 1 of the 4 versions of each of the 2 sample examinations. The ABA did not provide any support or collaboration with any authors.\r\n\r\nRESULTS\r\nThe anesthesiology fellow's answers were found to have a better median score than ChatGPT, for the module topics scores (P = .03). However, there was no significant difference in the median overall global module scores between the human and ChatGPT responses (P = .17). The examiners were able to identify the ChatGPT-generated answers for 23 of 24 modules (95.83%), with only 1 ChatGPT response perceived as from a human. In contrast, the examiners thought the human (fellow) responses were artificial intelligence (AI)-generated in 10 of 24 modules (41.67%). Examiner comments explained that ChatGPT generated relevant content, but were lengthy answers, which at times did not focus on the specific scenario priorities. There were no comments from the examiners regarding Chat GPT fact \"hallucinations.\"\r\n\r\nCONCLUSIONS\r\nChatGPT generated SOE answers with comparable module ratings to anesthesiology fellows, as graded by 8 ABA oral board examiners. However, the ChatGPT answers were deemed subjectively inferior due to the length of responses and lack of focus. Future curation and training of an AI database, like ChatGPT, could produce answers more in line with ideal ABA SOE answers. This could lead to higher performance and an anesthesiology-specific trained AI useful for training and examination preparation.","PeriodicalId":7799,"journal":{"name":"Anesthesia & Analgesia","volume":"46 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anesthesia & Analgesia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1213/ane.0000000000006875","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

BACKGROUND Chat Generative Pre-Trained Transformer (ChatGPT) has been tested and has passed various high-level examinations. However, it has not been tested on an examination such as the American Board of Anesthesiology (ABA) Standardized Oral Examination (SOE). The SOE is designed to assess higher-level competencies, such as judgment, organization, adaptability to unexpected clinical changes, and presentation of information. METHODS Four anesthesiology fellows were examined on 2 sample ABA SOEs. Their answers were compared to those produced by the same questions asked to ChatGPT. All human and ChatGPT responses were transcribed, randomized by module, and then reproduced as complete examinations, using a commercially available software-based human voice replicator. Eight ABA applied examiners listened to and scored the topic and modules from 1 of the 4 versions of each of the 2 sample examinations. The ABA did not provide any support or collaboration with any authors. RESULTS The anesthesiology fellow's answers were found to have a better median score than ChatGPT, for the module topics scores (P = .03). However, there was no significant difference in the median overall global module scores between the human and ChatGPT responses (P = .17). The examiners were able to identify the ChatGPT-generated answers for 23 of 24 modules (95.83%), with only 1 ChatGPT response perceived as from a human. In contrast, the examiners thought the human (fellow) responses were artificial intelligence (AI)-generated in 10 of 24 modules (41.67%). Examiner comments explained that ChatGPT generated relevant content, but were lengthy answers, which at times did not focus on the specific scenario priorities. There were no comments from the examiners regarding Chat GPT fact "hallucinations." CONCLUSIONS ChatGPT generated SOE answers with comparable module ratings to anesthesiology fellows, as graded by 8 ABA oral board examiners. However, the ChatGPT answers were deemed subjectively inferior due to the length of responses and lack of focus. Future curation and training of an AI database, like ChatGPT, could produce answers more in line with ideal ABA SOE answers. This could lead to higher performance and an anesthesiology-specific trained AI useful for training and examination preparation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

麻醉学口试中 ChatGPT 与人类表现的对比探索性分析：初步见解和启示。

背景Chat Generative Pre-Trained Transformer（ChatGPT）已经过测试，并通过了各种高级考试。但是，它还没有在美国麻醉学委员会（ABA）标准化口试（SOE）等考试中接受过测试。SOE 旨在评估更高层次的能力，如判断力、组织能力、对意外临床变化的适应能力以及信息表达能力。他们的答案与 ChatGPT 的相同问题所产生的答案进行了比较。所有人类和 ChatGPT 的回答都被转录下来，按模块随机排列，然后使用市售的基于软件的人类语音复制器将其复制为完整的考试内容。八名 ABA 应用考官聆听了 2 个样本考试 4 个版本中的 1 个版本的题目和模块，并进行了评分。结果在模块题目得分方面，麻醉学研究员的答案中位数得分高于 ChatGPT（P = .03）。但是，人类和 ChatGPT 回答的模块总分中位数没有明显差异（P = .17）。在 24 个模块中，有 23 个模块（95.83%）的答案是由 ChatGPT 生成的，只有 1 个模块的 ChatGPT 答案被认为来自人类。相反，在 24 个模块中，有 10 个模块（41.67%）的人类（同伴）回答被考官认为是人工智能（AI）生成的。考官们的评论解释说，ChatGPT 生成了相关内容，但答案冗长，有时没有关注具体情景的优先事项。考官们没有对 ChatGPT 事实 "幻觉 "发表任何评论。结论根据 8 位 ABA 口试考官的评分，ChatGPT 生成的 SOE 答案与麻醉学研究员的模块评分相当。然而，由于回答冗长且缺乏重点，ChatGPT 的答案在主观上被认为较差。未来对人工智能数据库（如 ChatGPT）的整理和训练可以生成更符合理想的 ABA SOE 答案。这可能会带来更高的性能，并使麻醉学专用的训练有素的人工智能在培训和考试准备中发挥作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助