Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors.

IF 1.3 Q4 HEALTH CARE SCIENCES & SERVICES BMJ Open Quality Pub Date : 2024-06-03 DOI:10.1136/bmjoq-2023-002654

Yukinori Harada, Tomoharu Suzuki, Taku Harada, Tetsu Sakamoto, Kosuke Ishizuka, Taiju Miyagami, Ren Kawamura, Kotaro Kunitomo, Hiroyuki Nagano, Taro Shimizu, Takashi Watari

{"title":"Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors.","authors":"Yukinori Harada, Tomoharu Suzuki, Taku Harada, Tetsu Sakamoto, Kosuke Ishizuka, Taiju Miyagami, Ren Kawamura, Kotaro Kunitomo, Hiroyuki Nagano, Taro Shimizu, Takashi Watari","doi":"10.1136/bmjoq-2023-002654","DOIUrl":null,"url":null,"abstract":"Background: Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors.Objective: This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations.Methods: We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians.Results: ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP.Conclusion: ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.","PeriodicalId":9052,"journal":{"name":"BMJ Open Quality","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149143/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Open Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjoq-2023-002654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors.

Objective: This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations.

Methods: We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians.

Results: ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP.

Conclusion: ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ChatGPT 在检测诊断错误及其诱因方面的性能评估：对 545 份诊断错误病例报告的分析。

背景：使用经过验证的评估工具进行人工病历审查是一种检测诊断错误的标准化方法。然而，这需要大量的人力资源和时间。最近开发的基于大型语言模型的人工智能聊天机器人 ChatGPT 可以根据适当的提示对文本进行有效分类。因此，ChatGPT 可以帮助人工病历审查发现诊断错误：本研究旨在阐明 ChatGPT 是否能根据病例陈述正确检测诊断错误以及导致诊断错误的可能因素：我们分析了 545 份包含诊断错误的已发表病例报告。我们将病例陈述和最终诊断的文本以及一些原始提示输入 ChatGPT（GPT-4），以生成包括诊断错误判断和诊断错误诱因在内的回答。导致诊断错误的因素按照以下三个分类标准进行编码：诊断错误评估与研究（DEER）、可靠诊断挑战（RDC）和通用诊断陷阱（GDP）。将 ChatGPT 与医生对诱因的回答进行了比较：结果：ChatGPT 在 519/545 个病例（95%）中正确检测出了诊断错误，并对每个病例中导致诊断错误的因素进行了编码，其数量在统计学上高于医生：DEER（中位数为 5 vs 1，pConclusion）：ChatGPT 可从病例陈述中准确发现诊断错误。在发现导致诊断错误的因素方面，ChatGPT 可能比人工审核更敏感，尤其是在 "非典型表现 "方面。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊