Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician?

IF 2.2 Q2 MEDICINE, GENERAL & INTERNAL Diagnosis Pub Date : 2024-03-12 eCollection Date: 2024-08-01 DOI:10.1515/dx-2024-0027
Kazuya Mizuta, Takanobu Hirosawa, Yukinori Harada, Taro Shimizu
{"title":"Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician?","authors":"Kazuya Mizuta, Takanobu Hirosawa, Yukinori Harada, Taro Shimizu","doi":"10.1515/dx-2024-0027","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>The potential of artificial intelligence (AI) chatbots, particularly the fourth-generation chat generative pretrained transformer (ChatGPT-4), in assisting with medical diagnosis is an emerging research area. While there has been significant emphasis on creating lists of differential diagnoses, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in these lists. This short communication aimed to assess the accuracy of ChatGPT-4 in evaluating lists of differential diagnosis compared to medical professionals' assessments.</p><p><strong>Methods: </strong>We used ChatGPT-4 to evaluate whether the final diagnosis was included in the top 10 differential diagnosis lists created by physicians, ChatGPT-3, and ChatGPT-4, using clinical vignettes. Eighty-two clinical vignettes were used, comprising 52 complex case reports published by the authors from the department and 30 mock cases of common diseases created by physicians from the same department. We compared the agreement between ChatGPT-4 and the physicians on whether the final diagnosis was included in the top 10 differential diagnosis lists using the kappa coefficient.</p><p><strong>Results: </strong>Three sets of differential diagnoses were evaluated for each of the 82 cases, resulting in a total of 246 lists. The agreement rate between ChatGPT-4 and physicians was 236 out of 246 (95.9 %), with a kappa coefficient of 0.86, indicating very good agreement.</p><p><strong>Conclusions: </strong>ChatGPT-4 demonstrated very good agreement with physicians in evaluating whether the final diagnosis should be included in the differential diagnosis lists.</p>","PeriodicalId":11273,"journal":{"name":"Diagnosis","volume":null,"pages":null},"PeriodicalIF":2.2000,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnosis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/dx-2024-0027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: The potential of artificial intelligence (AI) chatbots, particularly the fourth-generation chat generative pretrained transformer (ChatGPT-4), in assisting with medical diagnosis is an emerging research area. While there has been significant emphasis on creating lists of differential diagnoses, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in these lists. This short communication aimed to assess the accuracy of ChatGPT-4 in evaluating lists of differential diagnosis compared to medical professionals' assessments.

Methods: We used ChatGPT-4 to evaluate whether the final diagnosis was included in the top 10 differential diagnosis lists created by physicians, ChatGPT-3, and ChatGPT-4, using clinical vignettes. Eighty-two clinical vignettes were used, comprising 52 complex case reports published by the authors from the department and 30 mock cases of common diseases created by physicians from the same department. We compared the agreement between ChatGPT-4 and the physicians on whether the final diagnosis was included in the top 10 differential diagnosis lists using the kappa coefficient.

Results: Three sets of differential diagnoses were evaluated for each of the 82 cases, resulting in a total of 246 lists. The agreement rate between ChatGPT-4 and physicians was 236 out of 246 (95.9 %), with a kappa coefficient of 0.86, indicating very good agreement.

Conclusions: ChatGPT-4 demonstrated very good agreement with physicians in evaluating whether the final diagnosis should be included in the differential diagnosis lists.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ChatGPT-4 能否像医生一样准确评估鉴别诊断列表是否包含正确诊断?
研究目的人工智能(AI)聊天机器人,尤其是第四代聊天生成预训练转换器(ChatGPT-4)在协助医疗诊断方面的潜力是一个新兴的研究领域。虽然创建鉴别诊断列表受到了极大重视,但人工智能聊天机器人在评估最终诊断是否包含在这些列表中的能力如何尚不清楚。这篇短文旨在评估 ChatGPT-4 在评估鉴别诊断清单时与医疗专业人员的评估结果相比的准确性:我们使用 ChatGPT-4 评估了最终诊断是否包含在医生、ChatGPT-3 和 ChatGPT-4 创建的前 10 个鉴别诊断列表中,并使用了临床案例。我们使用了 82 个临床案例,其中包括 52 个由科室作者发表的复杂病例报告和 30 个由同一科室医生创建的常见疾病模拟病例。我们使用卡帕系数比较了 ChatGPT-4 和医生在最终诊断是否被列入前 10 个鉴别诊断列表上的一致性:结果:我们对 82 个病例中的每个病例评估了三组鉴别诊断,共得出 246 个清单。在 246 份清单中,ChatGPT-4 与医生的一致率为 236 份(95.9%),卡帕系数为 0.86,表明一致率非常高:结论:ChatGPT-4 与医生在评估最终诊断是否应纳入鉴别诊断列表时表现出了很好的一致性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Diagnosis
Diagnosis MEDICINE, GENERAL & INTERNAL-
CiteScore
7.20
自引率
5.70%
发文量
41
期刊介绍: Diagnosis focuses on how diagnosis can be advanced, how it is taught, and how and why it can fail, leading to diagnostic errors. The journal welcomes both fundamental and applied works, improvement initiatives, opinions, and debates to encourage new thinking on improving this critical aspect of healthcare quality.  Topics: -Factors that promote diagnostic quality and safety -Clinical reasoning -Diagnostic errors in medicine -The factors that contribute to diagnostic error: human factors, cognitive issues, and system-related breakdowns -Improving the value of diagnosis – eliminating waste and unnecessary testing -How culture and removing blame promote awareness of diagnostic errors -Training and education related to clinical reasoning and diagnostic skills -Advances in laboratory testing and imaging that improve diagnostic capability -Local, national and international initiatives to reduce diagnostic error
期刊最新文献
Lessons in clinical reasoning - pitfalls, myths, and pearls: a case of persistent dysphagia and patient partnership. Root cause analysis of cases involving diagnosis. A delayed diagnosis of hyperthyroidism in a patient with persistent vomiting in the presence of Chiari type 1 malformation. Systematic review and meta-analysis of observational studies evaluating glial fibrillary acidic protein (GFAP) and ubiquitin C-terminal hydrolase L1 (UCHL1) as blood biomarkers of mild acute traumatic brain injury (mTBI) or sport-related concussion (SRC) in adult subjects. Bridging the divide: addressing discrepancies between clinical guidelines, policy guidelines, and biomarker utilization.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1