Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning

IF 50 1区医学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Nature Medicine Pub Date : 2025-04-23 DOI:10.1038/s41591-025-03726-3

Mickael Tordjman, Zelong Liu, Murat Yuce, Valentin Fauveau, Yunhao Mei, Jerome Hadjadj, Ian Bolger, Haidara Almansour, Carolyn Horst, Ashwin Singh Parihar, Amine Geahchan, Anis Meribout, Nader Yatim, Nicole Ng, Phillip Robson, Alexander Zhou, Sara Lewis, Mingqian Huang, Timothy Deyer, Bachir Taouli, Hao-Chih Lee, Zahi A. Fayad, Xueyan Mei

{"title":"Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning","authors":"Mickael Tordjman, Zelong Liu, Murat Yuce, Valentin Fauveau, Yunhao Mei, Jerome Hadjadj, Ian Bolger, Haidara Almansour, Carolyn Horst, Ashwin Singh Parihar, Amine Geahchan, Anis Meribout, Nader Yatim, Nicole Ng, Phillip Robson, Alexander Zhou, Sara Lewis, Mingqian Huang, Timothy Deyer, Bachir Taouli, Hao-Chih Lee, Zahi A. Fayad, Xueyan Mei","doi":"10.1038/s41591-025-03726-3","DOIUrl":null,"url":null,"abstract":"DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10−3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Médicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements. The open-source DeepSeek large language model showed variable performance relative to two leading models when benchmarked on four different medical tasks, with relatively strong reasoning capabilities but similar or weaker relative performance on other tasks, such as summarization of imaging reports.","PeriodicalId":19037,"journal":{"name":"Nature Medicine","volume":"31 8","pages":"2550-2555"},"PeriodicalIF":50.0000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Medicine","FirstCategoryId":"3","ListUrlMain":"https://www.nature.com/articles/s41591-025-03726-3","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10−3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Médicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements. The open-source DeepSeek large language model showed variable performance relative to two leading models when benchmarked on four different medical tasks, with relatively strong reasoning capabilities but similar or weaker relative performance on other tasks, such as summarization of imaging reports.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DeepSeek 大型语言模型在医疗任务和临床推理方面的基准比较

DeepSeek 是一种新推出的大型语言模型 (LLM)，旨在增强推理能力，但其在医疗领域的能力尚未得到评估。本研究评估了三种 LLM（DeepSeek-R1、ChatGPT-o1 和 Llama 3.1-405B）执行四种不同医疗任务的能力：回答美国医学执业资格考试（USMLE）的问题、根据基于文本的诊断和管理案例进行解释和推理、根据 RECIST 1.1 标准提供肿瘤分类，以及提供多种模式的影像诊断报告摘要。在 USMLE 测试中，DeepSeek-R1（准确率=0.92）的表现略逊于 ChatGPT-o1（准确率=0.95；p = 0.04），但优于 Llama 3.1-405B（准确率=0.83；p <；10-3）。对于基于文本的病例挑战，DeepSeek-R1 的表现与 ChatGPT-o1 相似（使用《新英格兰医学杂志》和 Medicilline 数据库，准确率分别为 0.57 vs 0.55; p = 0.76 和 0.74 vs 0.76; p = 0.06）。在 RECIST 分类方面，DeepSeek-R1 的表现也与 ChatGPT-o1 相似（0.73 vs 0.81；p = 0.10）。DeepSeek 提供的诊断推理步骤被认为比 ChatGPT 和 Llama 3.1-405B 提供的步骤更准确（平均 Likert 分数分别为 3.61、3.22 和 3.13，p = 0.005 和 p <10-3）。然而，DeepSeek-R1 提供的成像报告摘要的总体质量低于 ChatGPT-o1 提供的报告摘要（5 点 Likert 评分：4.5 vs 4.8；p <；10-3）。这项研究凸显了 DeepSeek-R1 LLM 在医疗应用方面的潜力，但也强调了需要改进的地方。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Nature Medicine 医学-生化与分子生物学

CiteScore

100.90

自引率

0.70%

发文量

525

审稿时长

1 months

期刊介绍： Nature Medicine is a monthly journal publishing original peer-reviewed research in all areas of medicine. The publication focuses on originality, timeliness, interdisciplinary interest, and the impact on improving human health. In addition to research articles, Nature Medicine also publishes commissioned content such as News, Reviews, and Perspectives. This content aims to provide context for the latest advances in translational and clinical research, reaching a wide audience of M.D. and Ph.D. readers. All editorial decisions for the journal are made by a team of full-time professional editors. Nature Medicine consider all types of clinical research, including: -Case-reports and small case series -Clinical trials, whether phase 1, 2, 3 or 4 -Observational studies -Meta-analyses -Biomarker studies -Public and global health studies Nature Medicine is also committed to facilitating communication between translational and clinical researchers. As such, we consider “hybrid” studies with preclinical and translational findings reported alongside data from clinical studies.