Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES JMIR Formative Research Pub Date : 2025-02-27 DOI:10.2196/66478
Riley Scherr, Aidin Spina, Allen Dao, Saman Andalib, Faris F Halaseh, Sarah Blair, Warren Wiechmann, Ronald Rivera
{"title":"Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study.","authors":"Riley Scherr, Aidin Spina, Allen Dao, Saman Andalib, Faris F Halaseh, Sarah Blair, Warren Wiechmann, Ronald Rivera","doi":"10.2196/66478","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Case studies have shown ChatGPT can run clinical simulations at the medical student level. However, no data have assessed ChatGPT's reliability in meeting desired simulation criteria such as medical accuracy, simulation formatting, and robust feedback mechanisms.</p><p><strong>Objective: </strong>This study aims to quantify ChatGPT's ability to consistently follow formatting instructions and create simulations for preclinical medical student learners according to principles of medical simulation and multimedia educational technology.</p><p><strong>Methods: </strong>Using ChatGPT-4 and a prevalidated starting prompt, the authors ran 360 separate simulations of an acute asthma exacerbation. A total of 180 simulations were given correct answers and 180 simulations were given incorrect answers. ChatGPT was evaluated for its ability to adhere to basic simulation parameters (stepwise progression, free response, interactivity), advanced simulation parameters (autonomous conclusion, delayed feedback, comprehensive feedback), and medical accuracy (vignette, treatment updates, feedback). Significance was determined with χ² analyses using 95% CIs for odds ratios.</p><p><strong>Results: </strong>In total, 100% (n=360) of simulations met basic simulation parameters and were medically accurate. For advanced parameters, 55% (200/360) of all simulations delayed feedback, while the Correct arm (157/180, 87%) delayed feedback was significantly more than the Incorrect arm (43/180, 24%; P<.001). A total of 79% (285/360) of simulations concluded autonomously, and there was no difference between the Correct and Incorrect arms in autonomous conclusion (146/180, 81% and 139/180, 77%; P=.36). Overall, 78% (282/360) of simulations gave comprehensive feedback, and there was no difference between the Correct and Incorrect arms in comprehensive feedback (137/180, 76% and 145/180, 81%; P=.31). ChatGPT-4 was not significantly more likely to conclude simulations autonomously (P=.34) and provide comprehensive feedback (P=.27) when feedback was delayed compared to when feedback was not delayed.</p><p><strong>Conclusions: </strong>These simulations have the potential to be a reliable educational tool for simple simulations and can be evaluated by a novel 9-part metric. Per this metric, ChatGPT simulations performed perfectly on medical accuracy and basic simulation parameters. It performed well on comprehensive feedback and autonomous conclusion. Delayed feedback depended on the accuracy of user inputs. A simulation meeting one advanced parameter was not more likely to meet all advanced parameters. Further work must be done to ensure consistent performance across a broader range of simulation scenarios.</p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"9 ","pages":"e66478"},"PeriodicalIF":2.0000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11884304/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/66478","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Case studies have shown ChatGPT can run clinical simulations at the medical student level. However, no data have assessed ChatGPT's reliability in meeting desired simulation criteria such as medical accuracy, simulation formatting, and robust feedback mechanisms.

Objective: This study aims to quantify ChatGPT's ability to consistently follow formatting instructions and create simulations for preclinical medical student learners according to principles of medical simulation and multimedia educational technology.

Methods: Using ChatGPT-4 and a prevalidated starting prompt, the authors ran 360 separate simulations of an acute asthma exacerbation. A total of 180 simulations were given correct answers and 180 simulations were given incorrect answers. ChatGPT was evaluated for its ability to adhere to basic simulation parameters (stepwise progression, free response, interactivity), advanced simulation parameters (autonomous conclusion, delayed feedback, comprehensive feedback), and medical accuracy (vignette, treatment updates, feedback). Significance was determined with χ² analyses using 95% CIs for odds ratios.

Results: In total, 100% (n=360) of simulations met basic simulation parameters and were medically accurate. For advanced parameters, 55% (200/360) of all simulations delayed feedback, while the Correct arm (157/180, 87%) delayed feedback was significantly more than the Incorrect arm (43/180, 24%; P<.001). A total of 79% (285/360) of simulations concluded autonomously, and there was no difference between the Correct and Incorrect arms in autonomous conclusion (146/180, 81% and 139/180, 77%; P=.36). Overall, 78% (282/360) of simulations gave comprehensive feedback, and there was no difference between the Correct and Incorrect arms in comprehensive feedback (137/180, 76% and 145/180, 81%; P=.31). ChatGPT-4 was not significantly more likely to conclude simulations autonomously (P=.34) and provide comprehensive feedback (P=.27) when feedback was delayed compared to when feedback was not delayed.

Conclusions: These simulations have the potential to be a reliable educational tool for simple simulations and can be evaluated by a novel 9-part metric. Per this metric, ChatGPT simulations performed perfectly on medical accuracy and basic simulation parameters. It performed well on comprehensive feedback and autonomous conclusion. Delayed feedback depended on the accuracy of user inputs. A simulation meeting one advanced parameter was not more likely to meet all advanced parameters. Further work must be done to ensure consistent performance across a broader range of simulation scenarios.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ChatGPT-4早期临床教育患者管理模拟的新评估指标和量化性能:实验研究。
背景:案例研究表明,ChatGPT 可以在医学生水平上运行临床模拟。然而,还没有数据评估 ChatGPT 在满足预期模拟标准(如医学准确性、模拟格式化和强大的反馈机制)方面的可靠性:本研究旨在量化 ChatGPT 的能力,使其始终遵循格式化说明,并根据医学模拟和多媒体教育技术的原则为临床前医科学生学习者创建模拟:作者使用 ChatGPT-4 和普遍使用的起始提示,对急性哮喘加重进行了 360 次单独模拟。共有 180 次模拟回答正确,180 次模拟回答错误。对 ChatGPT 的基本模拟参数(逐步推进、自由反应、交互性)、高级模拟参数(自主结论、延迟反馈、综合反馈)和医疗准确性(小故事、治疗更新、反馈)进行了评估。使用 95% CIs 对几率比进行χ²分析,以确定其显著性:总共有 100%(n=360)的模拟符合基本模拟参数且医疗准确。对于高级参数,55%(200/360)的模拟延迟了反馈,而正确组(157/180,87%)的延迟反馈明显多于不正确组(43/180,24%;PC结论:这些模拟有可能成为简单模拟的可靠教育工具,并可通过新颖的 9 部分指标进行评估。根据这一指标,ChatGPT 模拟在医疗准确性和基本模拟参数方面表现完美。它在综合反馈和自主结论方面表现出色。延迟反馈取决于用户输入的准确性。满足一个高级参数的模拟并不更有可能满足所有高级参数。必须开展进一步的工作,以确保在更广泛的模拟场景中实现一致的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
JMIR Formative Research
JMIR Formative Research Medicine-Medicine (miscellaneous)
CiteScore
2.70
自引率
9.10%
发文量
579
审稿时长
12 weeks
期刊最新文献
Evaluating an Incentive-Based mHealth App for Physical Activity Promotion Using the Obesity-Related Behavioral Intervention Trial Model: Small Cohort Study. Rangatahi Youth-Led Dissemination Campaign for Cocreated Eating and Well-Being Guidelines: Process and Pilot Implementation Evaluation. A Text Messaging-Based Program to Transition From Basal Insulin to Glucagon-Like Peptide-1 Receptor Agonists in Safety-Net Diabetes Care: Pilot Quality Improvement Intervention Study. Provider-Engaged Development of a Sexual Dysfunction Screening Approach for Adolescents and Young Adult Childhood Cancer Survivors: Iterative Co-Design Study. Evaluation of a Parent Multimedia and Mobile-Based Intervention to Promote Pediatric Oral Health (BeReadyToSmile): Single-Group Pre-Post Feasibility Study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1