Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

IF 50 1区 医学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Nature Medicine Pub Date : 2025-04-23 DOI:10.1038/s41591-025-03726-3
Mickael Tordjman, Zelong Liu, Murat Yuce, Valentin Fauveau, Yunhao Mei, Jerome Hadjadj, Ian Bolger, Haidara Almansour, Carolyn Horst, Ashwin Singh Parihar, Amine Geahchan, Anis Meribout, Nader Yatim, Nicole Ng, Phillip Robson, Alexander Zhou, Sara Lewis, Mingqian Huang, Timothy Deyer, Bachir Taouli, Hao-Chih Lee, Zahi A. Fayad, Xueyan Mei
{"title":"Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning","authors":"Mickael Tordjman, Zelong Liu, Murat Yuce, Valentin Fauveau, Yunhao Mei, Jerome Hadjadj, Ian Bolger, Haidara Almansour, Carolyn Horst, Ashwin Singh Parihar, Amine Geahchan, Anis Meribout, Nader Yatim, Nicole Ng, Phillip Robson, Alexander Zhou, Sara Lewis, Mingqian Huang, Timothy Deyer, Bachir Taouli, Hao-Chih Lee, Zahi A. Fayad, Xueyan Mei","doi":"10.1038/s41591-025-03726-3","DOIUrl":null,"url":null,"abstract":"DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10−3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Médicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements. The open-source DeepSeek large language model showed variable performance relative to two leading models when benchmarked on four different medical tasks, with relatively strong reasoning capabilities but similar or weaker relative performance on other tasks, such as summarization of imaging reports.","PeriodicalId":19037,"journal":{"name":"Nature Medicine","volume":"31 8","pages":"2550-2555"},"PeriodicalIF":50.0000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Medicine","FirstCategoryId":"3","ListUrlMain":"https://www.nature.com/articles/s41591-025-03726-3","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10−3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Médicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements. The open-source DeepSeek large language model showed variable performance relative to two leading models when benchmarked on four different medical tasks, with relatively strong reasoning capabilities but similar or weaker relative performance on other tasks, such as summarization of imaging reports.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DeepSeek 大型语言模型在医疗任务和临床推理方面的基准比较
DeepSeek 是一种新推出的大型语言模型 (LLM),旨在增强推理能力,但其在医疗领域的能力尚未得到评估。本研究评估了三种 LLM(DeepSeek-R1、ChatGPT-o1 和 Llama 3.1-405B)执行四种不同医疗任务的能力:回答美国医学执业资格考试(USMLE)的问题、根据基于文本的诊断和管理案例进行解释和推理、根据 RECIST 1.1 标准提供肿瘤分类,以及提供多种模式的影像诊断报告摘要。在 USMLE 测试中,DeepSeek-R1(准确率=0.92)的表现略逊于 ChatGPT-o1(准确率=0.95;p = 0.04),但优于 Llama 3.1-405B(准确率=0.83;p <;10-3)。对于基于文本的病例挑战,DeepSeek-R1 的表现与 ChatGPT-o1 相似(使用《新英格兰医学杂志》和 Medicilline 数据库,准确率分别为 0.57 vs 0.55; p = 0.76 和 0.74 vs 0.76; p = 0.06)。在 RECIST 分类方面,DeepSeek-R1 的表现也与 ChatGPT-o1 相似(0.73 vs 0.81;p = 0.10)。DeepSeek 提供的诊断推理步骤被认为比 ChatGPT 和 Llama 3.1-405B 提供的步骤更准确(平均 Likert 分数分别为 3.61、3.22 和 3.13,p = 0.005 和 p <10-3)。然而,DeepSeek-R1 提供的成像报告摘要的总体质量低于 ChatGPT-o1 提供的报告摘要(5 点 Likert 评分:4.5 vs 4.8;p <;10-3)。这项研究凸显了 DeepSeek-R1 LLM 在医疗应用方面的潜力,但也强调了需要改进的地方。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Nature Medicine
Nature Medicine 医学-生化与分子生物学
CiteScore
100.90
自引率
0.70%
发文量
525
审稿时长
1 months
期刊介绍: Nature Medicine is a monthly journal publishing original peer-reviewed research in all areas of medicine. The publication focuses on originality, timeliness, interdisciplinary interest, and the impact on improving human health. In addition to research articles, Nature Medicine also publishes commissioned content such as News, Reviews, and Perspectives. This content aims to provide context for the latest advances in translational and clinical research, reaching a wide audience of M.D. and Ph.D. readers. All editorial decisions for the journal are made by a team of full-time professional editors. Nature Medicine consider all types of clinical research, including: -Case-reports and small case series -Clinical trials, whether phase 1, 2, 3 or 4 -Observational studies -Meta-analyses -Biomarker studies -Public and global health studies Nature Medicine is also committed to facilitating communication between translational and clinical researchers. As such, we consider “hybrid” studies with preclinical and translational findings reported alongside data from clinical studies.
期刊最新文献
In vivo generation of anti-BCMA CAR-T cells in relapsed or refractory multiple myeloma: a phase 1 study Pembrolizumab and olaparib in homologous-recombination-deficient metastatic pancreatic cancer: the phase 2 POLAR trial Engineering in vivo CAR-T cells A blueprint to accelerate rare pediatric gene therapy approvals Transmission dynamics of Oropouche virus in Latin America and the Caribbean
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1