Evaluating the accuracy and reliability of AI chatbots in disseminating the content of current resuscitation guidelines: a comparative analysis between the ERC 2021 guidelines and both ChatGPTs 3.5 and 4.

IF 3 2区 医学 Q1 EMERGENCY MEDICINE Scandinavian Journal of Trauma Resuscitation & Emergency Medicine Pub Date : 2024-09-26 DOI:10.1186/s13049-024-01266-2
Stefanie Beck, Manuel Kuhner, Markus Haar, Anne Daubmann, Martin Semmann, Stefan Kluge
{"title":"Evaluating the accuracy and reliability of AI chatbots in disseminating the content of current resuscitation guidelines: a comparative analysis between the ERC 2021 guidelines and both ChatGPTs 3.5 and 4.","authors":"Stefanie Beck, Manuel Kuhner, Markus Haar, Anne Daubmann, Martin Semmann, Stefan Kluge","doi":"10.1186/s13049-024-01266-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Aim of the study: </strong>Artificial intelligence (AI) chatbots are established as tools for answering medical questions worldwide. Healthcare trainees are increasingly using this cutting-edge technology, although its reliability and accuracy in the context of healthcare remain uncertain. This study evaluated the suitability of Chat-GPT versions 3.5 and 4 for healthcare professionals seeking up-to-date evidence and recommendations for resuscitation by comparing the key messages of the resuscitation guidelines, which methodically set the gold standard of current evidence and recommendations, with the statements of the AI chatbots on this topic.</p><p><strong>Methods: </strong> This prospective comparative content analysis was conducted between the 2021 European Resuscitation Council (ERC) guidelines and the responses of two freely available ChatGPT versions (ChatGPT-3.5 and the Bing version of the ChatGPT-4) to questions about the key messages of clinically relevant ERC guideline chapters for adults. (1) The content analysis was performed bidirectionally by independent raters. The completeness and actuality of the AI output were assessed by comparing the key message with the AI-generated statements. (2) The conformity of the AI output was evaluated by comparing the statements of the two ChatGPT versions with the content of the ERC guidelines.</p><p><strong>Results: </strong>In response to inquiries about the five chapters, ChatGPT-3.5 generated a total of 60 statements, whereas ChatGPT-4 produced 32 statements. ChatGPT-3.5 did not address 123 key messages, and ChatGPT-4 did not address 132 of the 172 key messages of the ERC guideline chapters. A total of 77% of the ChatGPT-3.5 statements and 84% of the ChatGPT-4 statements were fully in line with the ERC guidelines. The main reason for nonconformity was superficial and incorrect AI statements. The interrater reliability between the two raters, measured by Cohen's kappa, was greater for ChatGPT-4 (0.56 for completeness and 0.76 for conformity analysis) than for ChatGPT-3.5 (0.48 for completeness and 0.36 for conformity).</p><p><strong>Conclusion: </strong>We advise healthcare professionals not to rely solely on the tested AI-based chatbots to keep up to date with the latest evidence, as the relevant texts for the task were not part of the training texts of the underlying LLMs, and the lack of conceptual understanding of AI carries a high risk of spreading misconceptions. Original publications should always be considered for comprehensive understanding.</p>","PeriodicalId":49292,"journal":{"name":"Scandinavian Journal of Trauma Resuscitation & Emergency Medicine","volume":null,"pages":null},"PeriodicalIF":3.0000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11425874/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scandinavian Journal of Trauma Resuscitation & Emergency Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13049-024-01266-2","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EMERGENCY MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Aim of the study: Artificial intelligence (AI) chatbots are established as tools for answering medical questions worldwide. Healthcare trainees are increasingly using this cutting-edge technology, although its reliability and accuracy in the context of healthcare remain uncertain. This study evaluated the suitability of Chat-GPT versions 3.5 and 4 for healthcare professionals seeking up-to-date evidence and recommendations for resuscitation by comparing the key messages of the resuscitation guidelines, which methodically set the gold standard of current evidence and recommendations, with the statements of the AI chatbots on this topic.

Methods:  This prospective comparative content analysis was conducted between the 2021 European Resuscitation Council (ERC) guidelines and the responses of two freely available ChatGPT versions (ChatGPT-3.5 and the Bing version of the ChatGPT-4) to questions about the key messages of clinically relevant ERC guideline chapters for adults. (1) The content analysis was performed bidirectionally by independent raters. The completeness and actuality of the AI output were assessed by comparing the key message with the AI-generated statements. (2) The conformity of the AI output was evaluated by comparing the statements of the two ChatGPT versions with the content of the ERC guidelines.

Results: In response to inquiries about the five chapters, ChatGPT-3.5 generated a total of 60 statements, whereas ChatGPT-4 produced 32 statements. ChatGPT-3.5 did not address 123 key messages, and ChatGPT-4 did not address 132 of the 172 key messages of the ERC guideline chapters. A total of 77% of the ChatGPT-3.5 statements and 84% of the ChatGPT-4 statements were fully in line with the ERC guidelines. The main reason for nonconformity was superficial and incorrect AI statements. The interrater reliability between the two raters, measured by Cohen's kappa, was greater for ChatGPT-4 (0.56 for completeness and 0.76 for conformity analysis) than for ChatGPT-3.5 (0.48 for completeness and 0.36 for conformity).

Conclusion: We advise healthcare professionals not to rely solely on the tested AI-based chatbots to keep up to date with the latest evidence, as the relevant texts for the task were not part of the training texts of the underlying LLMs, and the lack of conceptual understanding of AI carries a high risk of spreading misconceptions. Original publications should always be considered for comprehensive understanding.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估人工智能聊天机器人在传播现行复苏指南内容方面的准确性和可靠性:ERC 2021 指南与 ChatGPT 3.5 和 4 之间的比较分析。
研究目的人工智能(AI)聊天机器人已成为全球范围内回答医疗问题的工具。尽管其在医疗保健方面的可靠性和准确性仍不确定,但医疗保健学员正越来越多地使用这一尖端技术。本研究通过比较复苏指南的关键信息(该指南有条不紊地设定了当前证据和建议的黄金标准)和人工智能聊天机器人关于该主题的声明,评估了聊天机器人 3.5 版和 4 版是否适合医疗专业人员寻求复苏的最新证据和建议: 这项前瞻性比较内容分析是在 2021 年欧洲复苏委员会(ERC)指南和两个免费提供的 ChatGPT 版本(ChatGPT-3.5 和 Bing 版 ChatGPT-4)对与临床相关的 ERC 成人指南章节关键信息的回答之间进行的。(1)内容分析由独立评分员双向进行。通过比较关键信息和人工智能生成的语句,评估人工智能输出的完整性和真实性。(2) 通过比较两个 ChatGPT 版本的表述与对外关系与合作部门指南的内容,评估了人工智能输出的一致性:在回答有关五个章节的询问时,ChatGPT-3.5 共生成了 60 条陈述,而 ChatGPT-4 则生成了 32 条陈述。ChatGPT-3.5 未涉及 123 条关键信息,而 ChatGPT-4 未涉及 ERC 指南各章节 172 条关键信息中的 132 条。共有 77% 的 ChatGPT-3.5 声明和 84% 的 ChatGPT-4 声明完全符合 ERC 指南。不符合的主要原因是人工智能表述肤浅和不正确。以科恩卡帕(Cohen's kappa)来衡量,ChatGPT-4(完整性为 0.56,符合性分析为 0.76)的两位评分者之间的评分者间可靠性高于 ChatGPT-3.5(完整性为 0.48,符合性分析为 0.36):我们建议医疗保健专业人员不要仅仅依靠经过测试的基于人工智能的聊天机器人来了解最新证据,因为任务的相关文本并不是基础 LLM 培训文本的一部分,而且缺乏对人工智能概念的理解很有可能导致误解的传播。要全面了解人工智能,应始终考虑原始出版物。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.10
自引率
6.10%
发文量
57
审稿时长
6-12 weeks
期刊介绍: The primary topics of interest in Scandinavian Journal of Trauma, Resuscitation and Emergency Medicine (SJTREM) are the pre-hospital and early in-hospital diagnostic and therapeutic aspects of emergency medicine, trauma, and resuscitation. Contributions focusing on dispatch, major incidents, etiology, pathophysiology, rehabilitation, epidemiology, prevention, education, training, implementation, work environment, as well as ethical and socio-economic aspects may also be assessed for publication.
期刊最新文献
Racing against time: Emergency ambulance dispatches and response times, a register-based study in Region Zealand, Denmark, 2013-2022. Low-energy, high risk: unveiling the undertriage crisis in geriatric trauma. Pre-hospital care for children: a descriptive study from Central Norway. Outcomes of odontoid fractures with associated cardiac arrest: retrospective bi-center case series and systematic literature review. Simulating the methodological bias in the ATLS classification of hypovolemic shock: a critical reappraisal of the base deficit renaissance.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1