The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls.

IF 2.1 3区医学 Q2 EMERGENCY MEDICINE Prehospital Emergency Care Pub Date : 2024-07-22 DOI:10.1080/10903127.2024.2376757

Graham Brant-Zawadzki, Brent Klapthor, Chris Ryba, Drew C Youngquist, Brooke Burton, Helen Palatinus, Scott T Youngquist

{"title":"The Performance of ChatGPT-4 and Gemini Ultra 1.0 for Quality Assurance Review in Emergency Medical Services Chest Pain Calls.","authors":"Graham Brant-Zawadzki, Brent Klapthor, Chris Ryba, Drew C Youngquist, Brooke Burton, Helen Palatinus, Scott T Youngquist","doi":"10.1080/10903127.2024.2376757","DOIUrl":null,"url":null,"abstract":"Objectives: This study assesses the feasibility, inter-rater reliability, and accuracy of using OpenAI's ChatGPT-4 and Google's Gemini Ultra large language models (LLMs), for Emergency Medical Services (EMS) quality assurance. The implementation of these LLMs for EMS quality assurance has the potential to significantly reduce the workload on medical directors and quality assurance staff by automating aspects of the processing and review of patient care reports. This offers the potential for more efficient and accurate identification of areas requiring improvement, thereby potentially enhancing patient care outcomes.Methods: Two expert human reviewers, ChatGPT GPT-4, and Gemini Ultra assessed and rated 150 consecutively sampled and anonymized prehospital records from 2 large urban EMS agencies for adherence to 2020 National Association of State EMS metrics for cardiac care. We evaluated the accuracy of scoring, inter-rater reliability, and review efficiency. The inter-rater reliability for the dichotomous outcome of each EMS metric was measured using the kappa statistic.Results: Human reviewers showed high interrater reliability, with 91.2% agreement and a kappa coefficient 0.782 (0.654-0.910). ChatGPT-4 achieved substantial agreement with human reviewers in EKG documentation and aspirin administration (76.2% agreement, kappa coefficient 0.401 (0.334-0.468), but performance varied across other metrics. Gemini Ultra's evaluation was discontinued due to poor performance. No significant differences were observed in median review times: 01:28 min (IQR 1:12 - 1:51 min) per human chart review, 01:24 min (IQR 01:09 - 01:53 min) per ChatGPT-4 chart review (p = 0.46), and 01:50 min (IQR 01:10-03:34 min) per Gemini Ultra review (p = 0.06).Conclusions: Large language models demonstrate potential in supporting quality assurance by effectively and objectively extracting data elements. However, their accuracy in interpreting non-standardized and time-sensitive details remains inferior to human evaluators. Our findings suggest that current LLMs may best offer supplemental support to the human review processes, but their current value remains limited. Enhancements in LLM training and integration are recommended for improved and more reliable performance in the quality assurance processes.","PeriodicalId":20336,"journal":{"name":"Prehospital Emergency Care","volume":" ","pages":"1-8"},"PeriodicalIF":2.1000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Prehospital Emergency Care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/10903127.2024.2376757","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: This study assesses the feasibility, inter-rater reliability, and accuracy of using OpenAI's ChatGPT-4 and Google's Gemini Ultra large language models (LLMs), for Emergency Medical Services (EMS) quality assurance. The implementation of these LLMs for EMS quality assurance has the potential to significantly reduce the workload on medical directors and quality assurance staff by automating aspects of the processing and review of patient care reports. This offers the potential for more efficient and accurate identification of areas requiring improvement, thereby potentially enhancing patient care outcomes.

Methods: Two expert human reviewers, ChatGPT GPT-4, and Gemini Ultra assessed and rated 150 consecutively sampled and anonymized prehospital records from 2 large urban EMS agencies for adherence to 2020 National Association of State EMS metrics for cardiac care. We evaluated the accuracy of scoring, inter-rater reliability, and review efficiency. The inter-rater reliability for the dichotomous outcome of each EMS metric was measured using the kappa statistic.

Results: Human reviewers showed high interrater reliability, with 91.2% agreement and a kappa coefficient 0.782 (0.654-0.910). ChatGPT-4 achieved substantial agreement with human reviewers in EKG documentation and aspirin administration (76.2% agreement, kappa coefficient 0.401 (0.334-0.468), but performance varied across other metrics. Gemini Ultra's evaluation was discontinued due to poor performance. No significant differences were observed in median review times: 01:28 min (IQR 1:12 - 1:51 min) per human chart review, 01:24 min (IQR 01:09 - 01:53 min) per ChatGPT-4 chart review (p = 0.46), and 01:50 min (IQR 01:10-03:34 min) per Gemini Ultra review (p = 0.06).

Conclusions: Large language models demonstrate potential in supporting quality assurance by effectively and objectively extracting data elements. However, their accuracy in interpreting non-standardized and time-sensitive details remains inferior to human evaluators. Our findings suggest that current LLMs may best offer supplemental support to the human review processes, but their current value remains limited. Enhancements in LLM training and integration are recommended for improved and more reliable performance in the quality assurance processes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ChatGPT-4 和 Gemini Ultra 1.0 在紧急医疗服务胸痛呼叫质量保证审查中的性能。

目的本研究评估了将 OpenAI 的 ChatGPT-4 和谷歌的 Gemini Ultra 大型语言模型 (LLM) 用于紧急医疗服务 (EMS) 质量保证的可行性、评分者之间的可靠性和准确性。将这些 LLMs 用于 EMS 质量保证有可能通过自动处理和审查病人护理报告的各个方面，大大减轻医务主任和质量保证人员的工作量。这为更高效、更准确地识别需要改进的领域提供了可能，从而有可能提高患者护理效果。方法：两名专家级人工审核员 ChatGPT GPT-4 和 Gemini Ultra 对来自 2 个大型城市急救医疗机构的 150 份连续抽样的匿名院前记录进行了评估和评分，以确定其是否符合 2020 年国家急救医疗协会的心脏护理指标。我们对评分的准确性、评分者之间的可靠性和审核效率进行了评估。结果：人工评审员的评审结果显示了较高的评审员间可靠性，一致率为 91.2%，卡帕系数为 0.782（0.654-0.910）。在心电图记录和阿司匹林给药方面，ChatGPT-4 与人工审核人员的一致性很高（一致性为 76.2%，卡帕系数为 0.401 (0.334-0.468)），但在其他指标上的表现各不相同。Gemini Ultra 的评估因表现不佳而终止。审查时间中位数无明显差异：每次人工病历审核时间为 01:28 分钟（IQR 1:12 - 1:51 分钟），每次 ChatGPT-4 病历审核时间为 01:24 分钟（IQR 01:09 - 01:53 分钟）（p = 0.46），每次 Gemini Ultra 审核时间为 01:50 分钟（IQR 01:10-03:34 分钟）（p = 0.06）：大型语言模型通过有效、客观地提取数据元素，在支持质量保证方面显示出潜力。结论：大型语言模型通过有效、客观地提取数据元素，在支持质量保证方面显示出潜力，但在解释非标准化和时间敏感性细节方面，其准确性仍然不如人类评估员。我们的研究结果表明，目前的 LLM 最能为人工评审流程提供补充支持，但其价值仍然有限。建议加强 LLM 的培训和整合，以改进质量保证流程并提高其可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Prehospital Emergency Care 医学-公共卫生、环境卫生与职业卫生

CiteScore

4.30

自引率

12.50%

发文量

137

审稿时长

1 months

期刊介绍： Prehospital Emergency Care publishes peer-reviewed information relevant to the practice, educational advancement, and investigation of prehospital emergency care, including the following types of articles: Special Contributions - Original Articles - Education and Practice - Preliminary Reports - Case Conferences - Position Papers - Collective Reviews - Editorials - Letters to the Editor - Media Reviews.