过于自信的人工智能?临床场景中的法律硕士自我评估基准

M. Omar, Benjamin S. Glicksberg, G. Nadkarni, E. Klang
{"title":"过于自信的人工智能?临床场景中的法律硕士自我评估基准","authors":"M. Omar, Benjamin S. Glicksberg, G. Nadkarni, E. Klang","doi":"10.1101/2024.08.11.24311810","DOIUrl":null,"url":null,"abstract":"Background and Aim: Large language models (LLMs) show promise in healthcare, but their self-assessment capabilities remain unclear. This study evaluates the confidence levels and performance of 12 LLMs across five medical specialties to assess their ability to accurately judge their responses. Methods: We used 1965 multiple-choice questions from internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and confidence scores. Performance and confidence were analyzed using chi-square tests and t-tests. Consistency across question versions was also evaluated. Results: All models displayed high confidence regardless of answer correctness. Higher-tier models showed slightly better calibration, with a mean confidence of 72.5% for correct answers versus 69.4% for incorrect ones, compared to lower-tier models (79.6% vs 79.5%). The mean confidence difference between correct and incorrect responses ranged from 0.6% to 5.4% across all models. Four models showed significantly higher confidence when correct (p<0.01), but the difference remained small. Most models demonstrated consistency across question versions. Conclusion: While newer LLMs show improved performance and consistency in medical knowledge tasks, their confidence levels remain poorly calibrated. The gap between performance and self-assessment poses risks in clinical applications. Until these models can reliably gauge their certainty, their use in healthcare should be limited and supervised by experts. Further research on human-AI collaboration and ensemble methods is needed for responsible implementation. Keywords: Large Language Models (LLMs), Safe AI, AI Reliability, Clinical knowledge.","PeriodicalId":18505,"journal":{"name":"medRxiv","volume":"2 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Overconfident AI? Benchmarking LLM Self-Assessment in Clinical Scenarios\",\"authors\":\"M. Omar, Benjamin S. Glicksberg, G. Nadkarni, E. Klang\",\"doi\":\"10.1101/2024.08.11.24311810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background and Aim: Large language models (LLMs) show promise in healthcare, but their self-assessment capabilities remain unclear. This study evaluates the confidence levels and performance of 12 LLMs across five medical specialties to assess their ability to accurately judge their responses. Methods: We used 1965 multiple-choice questions from internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and confidence scores. Performance and confidence were analyzed using chi-square tests and t-tests. Consistency across question versions was also evaluated. Results: All models displayed high confidence regardless of answer correctness. Higher-tier models showed slightly better calibration, with a mean confidence of 72.5% for correct answers versus 69.4% for incorrect ones, compared to lower-tier models (79.6% vs 79.5%). The mean confidence difference between correct and incorrect responses ranged from 0.6% to 5.4% across all models. Four models showed significantly higher confidence when correct (p<0.01), but the difference remained small. Most models demonstrated consistency across question versions. Conclusion: While newer LLMs show improved performance and consistency in medical knowledge tasks, their confidence levels remain poorly calibrated. The gap between performance and self-assessment poses risks in clinical applications. Until these models can reliably gauge their certainty, their use in healthcare should be limited and supervised by experts. Further research on human-AI collaboration and ensemble methods is needed for responsible implementation. Keywords: Large Language Models (LLMs), Safe AI, AI Reliability, Clinical knowledge.\",\"PeriodicalId\":18505,\"journal\":{\"name\":\"medRxiv\",\"volume\":\"2 12\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.11.24311810\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.11.24311810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景和目的:大语言模型(LLMs)在医疗保健领域大有可为,但其自我评估能力仍不明确。本研究评估了五个医学专业的 12 个 LLM 的信心水平和性能,以评估它们准确判断自己回答的能力。研究方法我们使用了来自内科、妇产科、精神病学、儿科和普通外科的 1965 道选择题。模型在提示下提供答案和置信度分数。采用卡方检验和 t 检验对成绩和信心进行分析。此外,还评估了不同问题版本的一致性。结果:无论答案正确与否,所有模型都显示出很高的置信度。与较低层次的模型(79.6% 对 79.5%)相比,高层次模型的校准效果稍好,正确答案的平均置信度为 72.5%,错误答案的平均置信度为 69.4%。在所有模型中,正确答案和错误答案之间的平均置信度差异从 0.6% 到 5.4% 不等。有四个模型显示正确答案的置信度明显更高(p<0.01),但差异仍然很小。大多数模型在不同的问题版本中表现出一致性。结论:虽然较新的 LLM 在医学知识任务中的表现和一致性有所改善,但其置信水平的校准仍然较差。表现与自我评估之间的差距给临床应用带来了风险。在这些模型能可靠地评估其确定性之前,它们在医疗保健领域的使用应受到限制,并应在专家的监督下进行。需要进一步研究人与人工智能的协作和组合方法,以便负责任地实施。关键词大型语言模型(LLMs)、安全人工智能、人工智能可靠性、临床知识。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Overconfident AI? Benchmarking LLM Self-Assessment in Clinical Scenarios
Background and Aim: Large language models (LLMs) show promise in healthcare, but their self-assessment capabilities remain unclear. This study evaluates the confidence levels and performance of 12 LLMs across five medical specialties to assess their ability to accurately judge their responses. Methods: We used 1965 multiple-choice questions from internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and confidence scores. Performance and confidence were analyzed using chi-square tests and t-tests. Consistency across question versions was also evaluated. Results: All models displayed high confidence regardless of answer correctness. Higher-tier models showed slightly better calibration, with a mean confidence of 72.5% for correct answers versus 69.4% for incorrect ones, compared to lower-tier models (79.6% vs 79.5%). The mean confidence difference between correct and incorrect responses ranged from 0.6% to 5.4% across all models. Four models showed significantly higher confidence when correct (p<0.01), but the difference remained small. Most models demonstrated consistency across question versions. Conclusion: While newer LLMs show improved performance and consistency in medical knowledge tasks, their confidence levels remain poorly calibrated. The gap between performance and self-assessment poses risks in clinical applications. Until these models can reliably gauge their certainty, their use in healthcare should be limited and supervised by experts. Further research on human-AI collaboration and ensemble methods is needed for responsible implementation. Keywords: Large Language Models (LLMs), Safe AI, AI Reliability, Clinical knowledge.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Factors determining hemoglobin levels in vaginally delivered term newborns at public hospitals in Lusaka, Zambia Accurate and cost-efficient whole genome sequencing of hepatitis B virus using Nanopore Mapping Epigenetic Gene Variant Dynamics: Comparative Analysis of Frequency, Functional Impact and Trait Associations in African and European Populations Assessing Population-level Accessibility to Medical College Hospitals in India: A Geospatial Modeling Study Targeted inference to identify drug repositioning candidates in the Danish health registries
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1