Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study

IF 9.3 1区 医学 Q1 CRITICAL CARE MEDICINE Critical Care Pub Date : 2025-02-10 DOI:10.1186/s13054-025-05302-0
Jessica D. Workum, Bas W. S. Volkers, Davy van de Sande, Sumesh Arora, Marco Goeijenbier, Diederik Gommers, Michel E. van Genderen
{"title":"Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study","authors":"Jessica D. Workum, Bas W. S. Volkers, Davy van de Sande, Sumesh Arora, Marco Goeijenbier, Diederik Gommers, Michel E. van Genderen","doi":"10.1186/s13054-025-05302-0","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) show increasing potential for their use in healthcare for administrative support and clinical decision making. However, reports on their performance in critical care medicine is lacking. This study evaluated five LLMs (GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Mistral Large 2407 and Llama 3.1 70B) on 1181 multiple choice questions (MCQs) from the gotheextramile.com database, a comprehensive database of critical care questions at European Diploma in Intensive Care examination level. Their performance was compared to random guessing and 350 human physicians on a 77-MCQ practice test. Metrics included accuracy, consistency, and domain-specific performance. Costs, as a proxy for energy consumption, were also analyzed. GPT-4o achieved the highest accuracy at 93.3%, followed by Llama 3.1 70B (87.5%), Mistral Large 2407 (87.9%), GPT-4o-mini (83.0%), and GPT-3.5-turbo (72.7%). Random guessing yielded 41.5% (p < 0.001). On the practice test, all models surpassed human physicians, scoring 89.0%, 80.9%, 84.4%, 80.3%, and 66.5%, respectively, compared to 42.7% for random guessing (p < 0.001) and 61.9% for the human physicians. However, in contrast to the other evaluated LLMs (p < 0.001), GPT-3.5-turbo’s performance did not significantly outperform physicians (p = 0.196). Despite high overall consistency, all models gave consistently incorrect answers. The most expensive model was GPT-4o, costing over 25 times more than the least expensive model, GPT-4o-mini. LLMs exhibit exceptional accuracy and consistency, with four outperforming human physicians on a European-level practice exam. GPT-4o led in performance but raised concerns about energy consumption. Despite their potential in critical care, all models produced consistently incorrect answers, highlighting the need for more thorough and ongoing evaluations to guide responsible implementation in clinical settings.","PeriodicalId":10811,"journal":{"name":"Critical Care","volume":"41 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Critical Care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13054-025-05302-0","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}
引用次数: 0

Abstract

Large language models (LLMs) show increasing potential for their use in healthcare for administrative support and clinical decision making. However, reports on their performance in critical care medicine is lacking. This study evaluated five LLMs (GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Mistral Large 2407 and Llama 3.1 70B) on 1181 multiple choice questions (MCQs) from the gotheextramile.com database, a comprehensive database of critical care questions at European Diploma in Intensive Care examination level. Their performance was compared to random guessing and 350 human physicians on a 77-MCQ practice test. Metrics included accuracy, consistency, and domain-specific performance. Costs, as a proxy for energy consumption, were also analyzed. GPT-4o achieved the highest accuracy at 93.3%, followed by Llama 3.1 70B (87.5%), Mistral Large 2407 (87.9%), GPT-4o-mini (83.0%), and GPT-3.5-turbo (72.7%). Random guessing yielded 41.5% (p < 0.001). On the practice test, all models surpassed human physicians, scoring 89.0%, 80.9%, 84.4%, 80.3%, and 66.5%, respectively, compared to 42.7% for random guessing (p < 0.001) and 61.9% for the human physicians. However, in contrast to the other evaluated LLMs (p < 0.001), GPT-3.5-turbo’s performance did not significantly outperform physicians (p = 0.196). Despite high overall consistency, all models gave consistently incorrect answers. The most expensive model was GPT-4o, costing over 25 times more than the least expensive model, GPT-4o-mini. LLMs exhibit exceptional accuracy and consistency, with four outperforming human physicians on a European-level practice exam. GPT-4o led in performance but raised concerns about energy consumption. Despite their potential in critical care, all models produced consistently incorrect answers, highlighting the need for more thorough and ongoing evaluations to guide responsible implementation in clinical settings.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型语言模型在专家级重症监护问题上的比较评估和表现:一项基准研究
大型语言模型(llm)在医疗保健管理支持和临床决策方面显示出越来越大的潜力。然而,缺乏关于它们在重症监护医学中的表现的报道。本研究评估了5位法学硕士(gpt - 40、gpt - 40 -mini、GPT-3.5-turbo、Mistral Large 2407和Llama 3.1 70B)在1181道选择题(mcq)上的表现,这些选择题来自gotheextramile.com数据库,该数据库是欧洲重症监护文凭考试水平的重症监护问题综合数据库。他们的表现与随机猜测和350名人类医生在77-MCQ练习测试中的表现进行了比较。度量标准包括准确性、一致性和特定于领域的性能。成本,作为能源消耗的代表,也进行了分析。gpt - 40达到最高的精度为93.3%,其次是美洲驼3.1 70B(87.5%)、西北风大2407(87.9%)、gpt - 40 -mini(83.0%)和GPT-3.5-turbo(72.7%)。随机猜测的概率为41.5% (p < 0.001)。在实践测试中,所有模型都超过了人类医生,得分分别为89.0%、80.9%、84.4%、80.3%和66.5%,而随机猜测的得分为42.7% (p < 0.001),人类医生的得分为61.9%。然而,与其他评估的LLMs相比(p < 0.001), GPT-3.5-turbo的表现并没有明显优于医生(p = 0.196)。尽管整体一致性很高,但所有模型给出的答案始终是错误的。最贵的型号是gpt - 40,比最便宜的型号gpt - 40 -mini贵25倍以上。法学硕士表现出卓越的准确性和一致性,在欧洲水平的实践考试中,有四位表现优于人类医生。gpt - 40的表现领先,但引发了对能源消耗的担忧。尽管它们在重症监护方面具有潜力,但所有模型都得出了一致的错误答案,这突出表明需要进行更彻底和持续的评估,以指导临床环境中负责任的实施。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Critical Care
Critical Care 医学-危重病医学
CiteScore
20.60
自引率
3.30%
发文量
348
审稿时长
1.5 months
期刊介绍: Critical Care is an esteemed international medical journal that undergoes a rigorous peer-review process to maintain its high quality standards. Its primary objective is to enhance the healthcare services offered to critically ill patients. To achieve this, the journal focuses on gathering, exchanging, disseminating, and endorsing evidence-based information that is highly relevant to intensivists. By doing so, Critical Care seeks to provide a thorough and inclusive examination of the intensive care field.
期刊最新文献
Perception of sex and race diversity in critical care medicine by generative AI: biases, measurement and implications. Awake burr hole craniotomy for chronic subdural hematoma: a phase 2 randomized controlled trial. Pediatric Intensive Care Core Outcomes-a modified Delphi consensus process (PIC-CO). Vancomycin trough levels: it's not one-size-fits-all. Reporting practices and impact of withdrawal of life-sustaining treatment on outcomes in acute brain injury clinical trials: a literature review and simulation study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1