Overconfident AI? Benchmarking LLM Self-Assessment in Clinical Scenarios

medRxiv Pub Date : 2024-08-11 DOI:10.1101/2024.08.11.24311810

M. Omar, Benjamin S. Glicksberg, G. Nadkarni, E. Klang

{"title":"Overconfident AI? Benchmarking LLM Self-Assessment in Clinical Scenarios","authors":"M. Omar, Benjamin S. Glicksberg, G. Nadkarni, E. Klang","doi":"10.1101/2024.08.11.24311810","DOIUrl":null,"url":null,"abstract":"Background and Aim: Large language models (LLMs) show promise in healthcare, but their self-assessment capabilities remain unclear. This study evaluates the confidence levels and performance of 12 LLMs across five medical specialties to assess their ability to accurately judge their responses. Methods: We used 1965 multiple-choice questions from internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and confidence scores. Performance and confidence were analyzed using chi-square tests and t-tests. Consistency across question versions was also evaluated. Results: All models displayed high confidence regardless of answer correctness. Higher-tier models showed slightly better calibration, with a mean confidence of 72.5% for correct answers versus 69.4% for incorrect ones, compared to lower-tier models (79.6% vs 79.5%). The mean confidence difference between correct and incorrect responses ranged from 0.6% to 5.4% across all models. Four models showed significantly higher confidence when correct (p<0.01), but the difference remained small. Most models demonstrated consistency across question versions. Conclusion: While newer LLMs show improved performance and consistency in medical knowledge tasks, their confidence levels remain poorly calibrated. The gap between performance and self-assessment poses risks in clinical applications. Until these models can reliably gauge their certainty, their use in healthcare should be limited and supervised by experts. Further research on human-AI collaboration and ensemble methods is needed for responsible implementation. Keywords: Large Language Models (LLMs), Safe AI, AI Reliability, Clinical knowledge.","PeriodicalId":18505,"journal":{"name":"medRxiv","volume":"2 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.11.24311810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background and Aim: Large language models (LLMs) show promise in healthcare, but their self-assessment capabilities remain unclear. This study evaluates the confidence levels and performance of 12 LLMs across five medical specialties to assess their ability to accurately judge their responses. Methods: We used 1965 multiple-choice questions from internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and confidence scores. Performance and confidence were analyzed using chi-square tests and t-tests. Consistency across question versions was also evaluated. Results: All models displayed high confidence regardless of answer correctness. Higher-tier models showed slightly better calibration, with a mean confidence of 72.5% for correct answers versus 69.4% for incorrect ones, compared to lower-tier models (79.6% vs 79.5%). The mean confidence difference between correct and incorrect responses ranged from 0.6% to 5.4% across all models. Four models showed significantly higher confidence when correct (p<0.01), but the difference remained small. Most models demonstrated consistency across question versions. Conclusion: While newer LLMs show improved performance and consistency in medical knowledge tasks, their confidence levels remain poorly calibrated. The gap between performance and self-assessment poses risks in clinical applications. Until these models can reliably gauge their certainty, their use in healthcare should be limited and supervised by experts. Further research on human-AI collaboration and ensemble methods is needed for responsible implementation. Keywords: Large Language Models (LLMs), Safe AI, AI Reliability, Clinical knowledge.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

过于自信的人工智能？临床场景中的法律硕士自我评估基准

背景和目的：大语言模型（LLMs）在医疗保健领域大有可为，但其自我评估能力仍不明确。本研究评估了五个医学专业的 12 个 LLM 的信心水平和性能，以评估它们准确判断自己回答的能力。研究方法我们使用了来自内科、妇产科、精神病学、儿科和普通外科的 1965 道选择题。模型在提示下提供答案和置信度分数。采用卡方检验和 t 检验对成绩和信心进行分析。此外，还评估了不同问题版本的一致性。结果：无论答案正确与否，所有模型都显示出很高的置信度。与较低层次的模型（79.6% 对 79.5%）相比，高层次模型的校准效果稍好，正确答案的平均置信度为 72.5%，错误答案的平均置信度为 69.4%。在所有模型中，正确答案和错误答案之间的平均置信度差异从 0.6% 到 5.4% 不等。有四个模型显示正确答案的置信度明显更高（p<0.01），但差异仍然很小。大多数模型在不同的问题版本中表现出一致性。结论：虽然较新的 LLM 在医学知识任务中的表现和一致性有所改善，但其置信水平的校准仍然较差。表现与自我评估之间的差距给临床应用带来了风险。在这些模型能可靠地评估其确定性之前，它们在医疗保健领域的使用应受到限制，并应在专家的监督下进行。需要进一步研究人与人工智能的协作和组合方法，以便负责任地实施。关键词大型语言模型（LLMs）、安全人工智能、人工智能可靠性、临床知识。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

medRxiv

自引率

0.00%

发文量