Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations

IF 4.6 1区医学 Q1 CRITICAL CARE MEDICINE Resuscitation Pub Date : 2024-11-01 Epub Date: 2024-09-28 DOI:10.1016/j.resuscitation.2024.110404

Lu Wang , Yuqiang Mao , Lin Wang , Yujie Sun , Jiangdian Song , Yang Zhang

{"title":"Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations","authors":"Lu Wang , Yuqiang Mao , Lin Wang , Yujie Sun , Jiangdian Song , Yang Zhang","doi":"10.1016/j.resuscitation.2024.110404","DOIUrl":null,"url":null,"abstract":"<div><h3>Aim</h3><div>To assess the accuracy and reliability of GPT-4o for scoring examinees’ performance on cardiopulmonary resuscitation (CPR) skills tests.</div></div><div><h3>Methods</h3><div>This study included six experts certified to supervise the national medical licensing examination (three junior and three senior) who reviewed the CPR skills test videos across 103 examinees. All videos reviewed by the experts were subjected to automated assessment by GPT-4o. Both the experts and GPT-4o scored the videos across four sections: patient assessment, chest compressions, rescue breathing, and repeated operations. The experts subsequently rated GPT-4o’s reliability on a 5-point Likert scale (1, completely unreliable; 5, completely reliable). GPT-4o’s accuracy was evaluated using the intraclass correlation coefficient (for the first three sections) and Fleiss’ Kappa (for the last section) to assess the agreement between its scores vs. those of the experts.</div></div><div><h3>Results</h3><div>The mean accuracy scores for the patient assessment, chest compressions, rescue breathing, and repeated operation sections were 0.65, 0.58, 0.60, and 0.31, respectively, when comparing the GPT-4o’s vs. junior experts’ scores and 0.75, 0.65, 0.72, and 0.41, respectively, when comparing the GPT-4o’s vs. senior experts’ scores. For reliability, the median Likert scale scores were 4.00 (interquartile range [IQR] = 3.66–4.33, mean [standard deviation] = 3.95 [0.55]) and 4.33 (4.00–4.67, 4.29 [0.50]) for the junior and senior experts, respectively.</div></div><div><h3>Conclusions</h3><div>GPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.</div></div>","PeriodicalId":21052,"journal":{"name":"Resuscitation","volume":"204 ","pages":"Article 110404"},"PeriodicalIF":4.6000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Resuscitation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0300957224002983","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Aim

To assess the accuracy and reliability of GPT-4o for scoring examinees’ performance on cardiopulmonary resuscitation (CPR) skills tests.

Methods

This study included six experts certified to supervise the national medical licensing examination (three junior and three senior) who reviewed the CPR skills test videos across 103 examinees. All videos reviewed by the experts were subjected to automated assessment by GPT-4o. Both the experts and GPT-4o scored the videos across four sections: patient assessment, chest compressions, rescue breathing, and repeated operations. The experts subsequently rated GPT-4o’s reliability on a 5-point Likert scale (1, completely unreliable; 5, completely reliable). GPT-4o’s accuracy was evaluated using the intraclass correlation coefficient (for the first three sections) and Fleiss’ Kappa (for the last section) to assess the agreement between its scores vs. those of the experts.

Results

The mean accuracy scores for the patient assessment, chest compressions, rescue breathing, and repeated operation sections were 0.65, 0.58, 0.60, and 0.31, respectively, when comparing the GPT-4o’s vs. junior experts’ scores and 0.75, 0.65, 0.72, and 0.41, respectively, when comparing the GPT-4o’s vs. senior experts’ scores. For reliability, the median Likert scale scores were 4.00 (interquartile range [IQR] = 3.66–4.33, mean [standard deviation] = 3.95 [0.55]) and 4.33 (4.00–4.67, 4.29 [0.50]) for the junior and senior experts, respectively.

Conclusions

GPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GPT-4o 作为心肺复苏技能考试评估工具的适用性。

目的：评估 GPT-4o 对考生心肺复苏（CPR）技能测试成绩评分的准确性和可靠性：本研究包括六名获得国家医师资格考试监考资格的专家（三名初级专家和三名高级专家），他们审查了 103 名考生的心肺复苏技能测试视频。专家审查的所有视频均由 GPT-4o 进行自动评估。专家和 GPT-4o 对视频的四个部分进行评分：患者评估、胸外按压、人工呼吸和重复操作。随后，专家们用 5 点李克特量表对 GPT-4o 的可靠性进行评分（1 表示完全不可靠；5 表示完全可靠）。GPT-4o 的准确性使用类内相关系数（前三个部分）和 Fleiss' Kappa（最后一个部分）进行评估，以评估其分数与专家分数之间的一致性：患者评估、胸外按压、人工呼吸和重复操作部分的平均准确度得分分别为 0.65、0.58、0.60 和 0.31（GPT-4o 与初级专家的得分比较），而 GPT-4o 与高级专家的得分比较则分别为 0.75、0.65、0.72 和 0.41。在可靠性方面，初级专家和高级专家的李克特量表得分中位数分别为 4.00（四分位数间距 [IQR] = 3.66-4.33，平均值 [标准差] = 3.95 [0.55]）和 4.33（4.00-4.67，4.29 [0.50]）：GPT-4o 在检查心肺复苏技能考试视频方面表现出了与资深专家相似的准确性。这些结果表明，在医疗检查环境中使用这种大型语言模型是有潜力的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Resuscitation 医学-急救医学

CiteScore

12.00

自引率

18.50%

发文量

556

审稿时长

21 days

期刊介绍： Resuscitation is a monthly international and interdisciplinary medical journal. The papers published deal with the aetiology, pathophysiology and prevention of cardiac arrest, resuscitation training, clinical resuscitation, and experimental resuscitation research, although papers relating to animal studies will be published only if they are of exceptional interest and related directly to clinical cardiopulmonary resuscitation. Papers relating to trauma are published occasionally but the majority of these concern traumatic cardiac arrest.