Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations

IF 6.5 1区 医学 Q1 CRITICAL CARE MEDICINE Resuscitation Pub Date : 2024-11-01 DOI:10.1016/j.resuscitation.2024.110404
Lu Wang , Yuqiang Mao , Lin Wang , Yujie Sun , Jiangdian Song , Yang Zhang
{"title":"Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations","authors":"Lu Wang ,&nbsp;Yuqiang Mao ,&nbsp;Lin Wang ,&nbsp;Yujie Sun ,&nbsp;Jiangdian Song ,&nbsp;Yang Zhang","doi":"10.1016/j.resuscitation.2024.110404","DOIUrl":null,"url":null,"abstract":"<div><h3>Aim</h3><div>To assess the accuracy and reliability of GPT-4o for scoring examinees’ performance on cardiopulmonary resuscitation (CPR) skills tests.</div></div><div><h3>Methods</h3><div>This study included six experts certified to supervise the national medical licensing examination (three junior and three senior) who reviewed the CPR skills test videos across 103 examinees. All videos reviewed by the experts were subjected to automated assessment by GPT-4o. Both the experts and GPT-4o scored the videos across four sections: patient assessment, chest compressions, rescue breathing, and repeated operations. The experts subsequently rated GPT-4o’s reliability on a 5-point Likert scale (1, completely unreliable; 5, completely reliable). GPT-4o’s accuracy was evaluated using the intraclass correlation coefficient (for the first three sections) and Fleiss’ Kappa (for the last section) to assess the agreement between its scores vs. those of the experts.</div></div><div><h3>Results</h3><div>The mean accuracy scores for the patient assessment, chest compressions, rescue breathing, and repeated operation sections were 0.65, 0.58, 0.60, and 0.31, respectively, when comparing the GPT-4o’s vs. junior experts’ scores and 0.75, 0.65, 0.72, and 0.41, respectively, when comparing the GPT-4o’s vs. senior experts’ scores. For reliability, the median Likert scale scores were 4.00 (interquartile range [IQR] = 3.66–4.33, mean [standard deviation] = 3.95 [0.55]) and 4.33 (4.00–4.67, 4.29 [0.50]) for the junior and senior experts, respectively.</div></div><div><h3>Conclusions</h3><div>GPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.</div></div>","PeriodicalId":21052,"journal":{"name":"Resuscitation","volume":"204 ","pages":"Article 110404"},"PeriodicalIF":6.5000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Resuscitation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0300957224002983","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Aim

To assess the accuracy and reliability of GPT-4o for scoring examinees’ performance on cardiopulmonary resuscitation (CPR) skills tests.

Methods

This study included six experts certified to supervise the national medical licensing examination (three junior and three senior) who reviewed the CPR skills test videos across 103 examinees. All videos reviewed by the experts were subjected to automated assessment by GPT-4o. Both the experts and GPT-4o scored the videos across four sections: patient assessment, chest compressions, rescue breathing, and repeated operations. The experts subsequently rated GPT-4o’s reliability on a 5-point Likert scale (1, completely unreliable; 5, completely reliable). GPT-4o’s accuracy was evaluated using the intraclass correlation coefficient (for the first three sections) and Fleiss’ Kappa (for the last section) to assess the agreement between its scores vs. those of the experts.

Results

The mean accuracy scores for the patient assessment, chest compressions, rescue breathing, and repeated operation sections were 0.65, 0.58, 0.60, and 0.31, respectively, when comparing the GPT-4o’s vs. junior experts’ scores and 0.75, 0.65, 0.72, and 0.41, respectively, when comparing the GPT-4o’s vs. senior experts’ scores. For reliability, the median Likert scale scores were 4.00 (interquartile range [IQR] = 3.66–4.33, mean [standard deviation] = 3.95 [0.55]) and 4.33 (4.00–4.67, 4.29 [0.50]) for the junior and senior experts, respectively.

Conclusions

GPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GPT-4o 作为心肺复苏技能考试评估工具的适用性。
目的:评估 GPT-4o 对考生心肺复苏(CPR)技能测试成绩评分的准确性和可靠性:本研究包括六名获得国家医师资格考试监考资格的专家(三名初级专家和三名高级专家),他们审查了 103 名考生的心肺复苏技能测试视频。专家审查的所有视频均由 GPT-4o 进行自动评估。专家和 GPT-4o 对视频的四个部分进行评分:患者评估、胸外按压、人工呼吸和重复操作。随后,专家们用 5 点李克特量表对 GPT-4o 的可靠性进行评分(1 表示完全不可靠;5 表示完全可靠)。GPT-4o 的准确性使用类内相关系数(前三个部分)和 Fleiss' Kappa(最后一个部分)进行评估,以评估其分数与专家分数之间的一致性:患者评估、胸外按压、人工呼吸和重复操作部分的平均准确度得分分别为 0.65、0.58、0.60 和 0.31(GPT-4o 与初级专家的得分比较),而 GPT-4o 与高级专家的得分比较则分别为 0.75、0.65、0.72 和 0.41。在可靠性方面,初级专家和高级专家的李克特量表得分中位数分别为 4.00(四分位数间距 [IQR] = 3.66-4.33,平均值 [标准差] = 3.95 [0.55])和 4.33(4.00-4.67,4.29 [0.50]):GPT-4o 在检查心肺复苏技能考试视频方面表现出了与资深专家相似的准确性。这些结果表明,在医疗检查环境中使用这种大型语言模型是有潜力的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Resuscitation
Resuscitation 医学-急救医学
CiteScore
12.00
自引率
18.50%
发文量
556
审稿时长
21 days
期刊介绍: Resuscitation is a monthly international and interdisciplinary medical journal. The papers published deal with the aetiology, pathophysiology and prevention of cardiac arrest, resuscitation training, clinical resuscitation, and experimental resuscitation research, although papers relating to animal studies will be published only if they are of exceptional interest and related directly to clinical cardiopulmonary resuscitation. Papers relating to trauma are published occasionally but the majority of these concern traumatic cardiac arrest.
期刊最新文献
Time-dependent association of grey-white ratio on early brain CT predicting outcomes after cardiac arrest at hospital discharge. Optimum oxygen concentration for initiation of delivery room stabilization in preterm neonates: A Randomized Controlled Trial. A new Era in CPR: Are you … Ready For it? Potassium disorders at intensive care unit admission and functional outcomes after cardiac arrest. Spindles of hope: A new Frontier in adult neuroprognostication following cardiac arrest
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1