A systematic evaluation of the performance of GPT-4 and PaLM2 to diagnose comorbidities in MIMIC-IV patients

Peter Sarvari, Zaid Al-fagih, Abdullatif Ghuwel, Othman Al-fagih
{"title":"A systematic evaluation of the performance of GPT-4 and PaLM2 to diagnose comorbidities in MIMIC-IV patients","authors":"Peter Sarvari,&nbsp;Zaid Al-fagih,&nbsp;Abdullatif Ghuwel,&nbsp;Othman Al-fagih","doi":"10.1002/hcs2.79","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Given the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT-4 and PaLM2. Small-scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT-4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>To fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully-written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Based on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT-4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT-4 correctly identified 1116 unique diagnoses.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>The results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.</p>\n </section>\n </div>","PeriodicalId":100601,"journal":{"name":"Health Care Science","volume":"3 1","pages":"3-18"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/hcs2.79","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Care Science","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/hcs2.79","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Given the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT-4 and PaLM2. Small-scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT-4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates.

Methods

To fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully-written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients.

Results

Based on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT-4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT-4 correctly identified 1116 unique diagnoses.

Conclusion

The results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
对 GPT-4 和 PaLM2 诊断 MIMIC-IV 患者合并症的性能进行系统评估
鉴于医院的诊断错误率极高,以及大型语言模型(LLM)的最新发展,我们开始测量两种常用 LLM 的诊断灵敏度:GPT-4 和 PaLM2。评估 LLM 诊断能力的小规模研究显示了良好的结果,GPT-4 在诊断测试病例方面表现出很高的准确性。为了填补这一文献空白,我们使用了波士顿贝斯以色列女执事医疗中心(Beth Israel Deaconess Medical Center)收治的约 30 万名患者的去身份化电子健康记录(EHR)数据集。该数据集包含血液、影像、微生物和生命体征信息以及患者的医疗诊断代码。根据现有的电子病历数据,医生们为每位患者整理出了一组诊断结果,我们将其称为 "基本真实诊断结果"。然后,我们设计了精心编写的提示语,以便从 LLMs 中获得病人的诊断预测,并在随机抽样的 1000 名病人中将其与基本真实诊断进行比较。PaLM2 在同一数据集上的诊断命中率为 84.7%。结果表明,人工智能(AI)在与临床医生合作时,有可能减少每年导致数十万例误诊的认知错误。然而,人类对人工智能的监督仍然至关重要:人工智能无法取代临床医生,尤其是在人类理解和移情方面。此外,将人工智能应用于医疗保健领域还存在大量挑战,包括道德、责任和监管障碍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
0.90
自引率
0.00%
发文量
0
期刊最新文献
Study protocol: A national cross-sectional study on psychology and behavior investigation of Chinese residents in 2023. Caregiving in Asia: Priority areas for research, policy, and practice to support family caregivers. Innovative public strategies in response to COVID-19: A review of practices from China. Sixty years of ethical evolution: The 2024 revision of the Declaration of Helsinki (DoH). A novel ensemble ARIMA-LSTM approach for evaluating COVID-19 cases and future outbreak preparedness.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1