Evaluating the accuracy and reliability of AI chatbots in disseminating the content of current resuscitation guidelines: a comparative analysis between the ERC 2021 guidelines and both ChatGPTs 3.5 and 4.

IF 3 2区医学 Q1 EMERGENCY MEDICINE Scandinavian Journal of Trauma Resuscitation & Emergency Medicine Pub Date : 2024-09-26 DOI:10.1186/s13049-024-01266-2

Stefanie Beck, Manuel Kuhner, Markus Haar, Anne Daubmann, Martin Semmann, Stefan Kluge

{"title":"Evaluating the accuracy and reliability of AI chatbots in disseminating the content of current resuscitation guidelines: a comparative analysis between the ERC 2021 guidelines and both ChatGPTs 3.5 and 4.","authors":"Stefanie Beck, Manuel Kuhner, Markus Haar, Anne Daubmann, Martin Semmann, Stefan Kluge","doi":"10.1186/s13049-024-01266-2","DOIUrl":null,"url":null,"abstract":"Aim of the study: Artificial intelligence (AI) chatbots are established as tools for answering medical questions worldwide. Healthcare trainees are increasingly using this cutting-edge technology, although its reliability and accuracy in the context of healthcare remain uncertain. This study evaluated the suitability of Chat-GPT versions 3.5 and 4 for healthcare professionals seeking up-to-date evidence and recommendations for resuscitation by comparing the key messages of the resuscitation guidelines, which methodically set the gold standard of current evidence and recommendations, with the statements of the AI chatbots on this topic.Methods: This prospective comparative content analysis was conducted between the 2021 European Resuscitation Council (ERC) guidelines and the responses of two freely available ChatGPT versions (ChatGPT-3.5 and the Bing version of the ChatGPT-4) to questions about the key messages of clinically relevant ERC guideline chapters for adults. (1) The content analysis was performed bidirectionally by independent raters. The completeness and actuality of the AI output were assessed by comparing the key message with the AI-generated statements. (2) The conformity of the AI output was evaluated by comparing the statements of the two ChatGPT versions with the content of the ERC guidelines.Results: In response to inquiries about the five chapters, ChatGPT-3.5 generated a total of 60 statements, whereas ChatGPT-4 produced 32 statements. ChatGPT-3.5 did not address 123 key messages, and ChatGPT-4 did not address 132 of the 172 key messages of the ERC guideline chapters. A total of 77% of the ChatGPT-3.5 statements and 84% of the ChatGPT-4 statements were fully in line with the ERC guidelines. The main reason for nonconformity was superficial and incorrect AI statements. The interrater reliability between the two raters, measured by Cohen's kappa, was greater for ChatGPT-4 (0.56 for completeness and 0.76 for conformity analysis) than for ChatGPT-3.5 (0.48 for completeness and 0.36 for conformity).Conclusion: We advise healthcare professionals not to rely solely on the tested AI-based chatbots to keep up to date with the latest evidence, as the relevant texts for the task were not part of the training texts of the underlying LLMs, and the lack of conceptual understanding of AI carries a high risk of spreading misconceptions. Original publications should always be considered for comprehensive understanding.","PeriodicalId":49292,"journal":{"name":"Scandinavian Journal of Trauma Resuscitation & Emergency Medicine","volume":null,"pages":null},"PeriodicalIF":3.0000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11425874/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scandinavian Journal of Trauma Resuscitation & Emergency Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13049-024-01266-2","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Aim of the study: Artificial intelligence (AI) chatbots are established as tools for answering medical questions worldwide. Healthcare trainees are increasingly using this cutting-edge technology, although its reliability and accuracy in the context of healthcare remain uncertain. This study evaluated the suitability of Chat-GPT versions 3.5 and 4 for healthcare professionals seeking up-to-date evidence and recommendations for resuscitation by comparing the key messages of the resuscitation guidelines, which methodically set the gold standard of current evidence and recommendations, with the statements of the AI chatbots on this topic.

Methods: This prospective comparative content analysis was conducted between the 2021 European Resuscitation Council (ERC) guidelines and the responses of two freely available ChatGPT versions (ChatGPT-3.5 and the Bing version of the ChatGPT-4) to questions about the key messages of clinically relevant ERC guideline chapters for adults. (1) The content analysis was performed bidirectionally by independent raters. The completeness and actuality of the AI output were assessed by comparing the key message with the AI-generated statements. (2) The conformity of the AI output was evaluated by comparing the statements of the two ChatGPT versions with the content of the ERC guidelines.

Results: In response to inquiries about the five chapters, ChatGPT-3.5 generated a total of 60 statements, whereas ChatGPT-4 produced 32 statements. ChatGPT-3.5 did not address 123 key messages, and ChatGPT-4 did not address 132 of the 172 key messages of the ERC guideline chapters. A total of 77% of the ChatGPT-3.5 statements and 84% of the ChatGPT-4 statements were fully in line with the ERC guidelines. The main reason for nonconformity was superficial and incorrect AI statements. The interrater reliability between the two raters, measured by Cohen's kappa, was greater for ChatGPT-4 (0.56 for completeness and 0.76 for conformity analysis) than for ChatGPT-3.5 (0.48 for completeness and 0.36 for conformity).

Conclusion: We advise healthcare professionals not to rely solely on the tested AI-based chatbots to keep up to date with the latest evidence, as the relevant texts for the task were not part of the training texts of the underlying LLMs, and the lack of conceptual understanding of AI carries a high risk of spreading misconceptions. Original publications should always be considered for comprehensive understanding.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估人工智能聊天机器人在传播现行复苏指南内容方面的准确性和可靠性：ERC 2021 指南与 ChatGPT 3.5 和 4 之间的比较分析。

研究目的人工智能（AI）聊天机器人已成为全球范围内回答医疗问题的工具。尽管其在医疗保健方面的可靠性和准确性仍不确定，但医疗保健学员正越来越多地使用这一尖端技术。本研究通过比较复苏指南的关键信息（该指南有条不紊地设定了当前证据和建议的黄金标准）和人工智能聊天机器人关于该主题的声明，评估了聊天机器人 3.5 版和 4 版是否适合医疗专业人员寻求复苏的最新证据和建议：这项前瞻性比较内容分析是在 2021 年欧洲复苏委员会（ERC）指南和两个免费提供的 ChatGPT 版本（ChatGPT-3.5 和 Bing 版 ChatGPT-4）对与临床相关的 ERC 成人指南章节关键信息的回答之间进行的。(1）内容分析由独立评分员双向进行。通过比较关键信息和人工智能生成的语句，评估人工智能输出的完整性和真实性。(2) 通过比较两个 ChatGPT 版本的表述与对外关系与合作部门指南的内容，评估了人工智能输出的一致性：在回答有关五个章节的询问时，ChatGPT-3.5 共生成了 60 条陈述，而 ChatGPT-4 则生成了 32 条陈述。ChatGPT-3.5 未涉及 123 条关键信息，而 ChatGPT-4 未涉及 ERC 指南各章节 172 条关键信息中的 132 条。共有 77% 的 ChatGPT-3.5 声明和 84% 的 ChatGPT-4 声明完全符合 ERC 指南。不符合的主要原因是人工智能表述肤浅和不正确。以科恩卡帕（Cohen's kappa）来衡量，ChatGPT-4（完整性为 0.56，符合性分析为 0.76）的两位评分者之间的评分者间可靠性高于 ChatGPT-3.5（完整性为 0.48，符合性分析为 0.36）：我们建议医疗保健专业人员不要仅仅依靠经过测试的基于人工智能的聊天机器人来了解最新证据，因为任务的相关文本并不是基础 LLM 培训文本的一部分，而且缺乏对人工智能概念的理解很有可能导致误解的传播。要全面了解人工智能，应始终考虑原始出版物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Scandinavian Journal of Trauma Resuscitation & Emergency Medicine EMERGENCY MEDICINE-

CiteScore

6.10

自引率

6.10%

发文量

审稿时长

6-12 weeks

期刊介绍： The primary topics of interest in Scandinavian Journal of Trauma, Resuscitation and Emergency Medicine (SJTREM) are the pre-hospital and early in-hospital diagnostic and therapeutic aspects of emergency medicine, trauma, and resuscitation. Contributions focusing on dispatch, major incidents, etiology, pathophysiology, rehabilitation, epidemiology, prevention, education, training, implementation, work environment, as well as ethical and socio-economic aspects may also be assessed for publication.