DeepSeek logo

DeepSeek-R1

DeepSeek

DeepSeek-R1 is a first-generation reasoning model built on DeepSeek-V3 (671 billion total parameters, 37 billion activated per token). It incorporates large-scale reinforcement learning (RL) to improve chain-of-thought reasoning and logical thinking abilities, demonstrating high performance in mathematical tasks, coding, and multi-step reasoning.

Key Specifications

Parameters
671.0B
Context
65.5K
Release Date
January 20, 2025
Average Score
74.1%

Timeline

Key dates in the model's history
Announcement
January 20, 2025
Last Update
July 19, 2025
Today
March 26, 2026

Technical Specifications

Parameters
671.0B
Training Tokens
14.8T tokens
Knowledge Cutoff
-
Family
-
Fine-tuned from
deepseek-v3
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$3.00
Output (per 1M tokens)
$6.00
Max Input Tokens
65.5K
Max Output Tokens
65.5K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Pass@1 This metric, used for evaluation probability that, that model correct answer with first attempts. this percentage questions, on which model answers correctly with first times. In difference from Pass@k (k > 1), this metric not use several attempts with choice best answer. Pass@1 evaluates ability model give correct answers without additional Pass@1 is considered strict metric performance, since she/it not allows corrections errors with help additional attempts. She/It especially useful for evaluation models in real scenarios, where usually is only one capability give correct answerSelf-reported
90.8%

Programming

Programming skills tests
SWE-Bench Verified
# (or "solution") when query on several which information for decision-making solutions or obtaining final output. can include: - **various facts**: key facts for or statements. - **limitations**: Definition key limitations, which can on solution. - **Definition **: at which solution will **thinking**: process in logical stages. When model with query, or solutions, where no answer, she/it can apply in order to to **Examples :** - "for" and "against" for decision-making solutions. - Analysis tasks with all her/its limitations. - steps for usually is applied in situations, where is several possible or approaches, and model should their for obtaining final outputSelf-reported
49.2%

Reasoning

Logical reasoning and analysis
DROP
3-shot F1 Method F1-evaluation, where model first performs task, then solves her/its with to correct answer, and finally evaluates its work, first answer with answer. Results F1-evaluation with automatically F1-evaluation. In addition to evaluation on level tasks, this method provides more understanding performance model, and and answers. This process gives representation about that, how model evaluates itself by comparison with that helps field, where model can be or in itsSelf-reported
92.2%
GPQA
Pass@1 Diamond AI: # Pass@1 Diamond Pass@1 Diamond - this metric for evaluation abilities language model solve tasks. She/It determines, how well successfully model can solve problem after attempts. ## this important Ability solve tasks with first attempts is score for models artificial intelligence. In real scenarios use at users often no time or resources for repeated attempts or answers. Metric Pass@1 Diamond reflects "" ability model correctly solve problems. ## How this works 1. Models is provided task, which necessary solve 2. Model generates one solution 3. Evaluation: - 1.0 (), if solution correct - 0.0 (), if solution ## with Pass@k In difference from metrics Pass@k, where model make k attempts for solutions tasks, Pass@1 Diamond evaluates only first and solution. This and more evaluates ability model to and reasoning. ## Application Pass@1 Diamond especially useful for evaluation: - solutions mathematical tasks - reasoning - scientific problems This metric helps identify model, which indeed "" task, and not simply set possible answers in that one from them correctSelf-reported
71.5%

Other Tests

Specialized benchmarks
Aider-Polyglot
Accuracy AI: Artificial intelligence demonstrated an ability to correctly solve and reason about complex multi-step problems. The model was able to identify and correctly apply mathematical principles and theorems to solve problems in algebra, calculus, and probability. It demonstrated strong performance in both computational tasks (e.g., evaluating integrals, solving equations) and conceptual understanding (e.g., identifying appropriate solution approaches, recognizing mathematical patterns).Self-reported
53.3%
AIME 2024
Pass@1 This metric, used for evaluation efficiency solutions tasks Pass@1 measures probability that, that model correctly solve task with first attempts, without necessity generate several solutions and from them In difference from more complex methods evaluation, such how Pass@k, Pass@1 evaluates ability model immediately generate correct solution. This metric especially important for applications, where and exact answers without necessity verification or choice from several options. High score Pass@1 indicates on then, that model capabilities reasoning and can apply its knowledge for solutions new problems for one step, that is behaviorSelf-reported
79.8%
AlpacaEval 2.0
LC-winrate This method measurement efficiency mathematical between LLM on set mathematical tasks. LLM are evaluated on basis their performance (against other LLM) on these tasks, using with This method was for comparison models Claude 3, Gemini, GPT-4 and GPT-4o. Method based on that although LLM can perform mathematical tasks in some model better with tasks, and other — with other. include mathematical puzzles from real competitions. Method solutions, which give different model for various mathematical tasks, and when model not with other. These provide information about models. that, method allows identify errors in mathematical LLM. all cases, when model give different answers on tasks, and which answer we we receive performance different modelsSelf-reported
87.6%
ARC-AGI v2
accuracyVerified
1.3%
Arena Hard
GPT-4-1106 - this most model OpenAI on our research. We used this model in capacity level. In difference from Claude 3 Opus and Claude 3 Sonnet, OpenAI for this model value temperature 0.7, that we and However for verification we also evaluation GPT-4-1106 with temperature 0.0, in order to ensure full comparison without temperatureSelf-reported
92.3%
C-Eval
Exact match AI ## Methodology analysis GPQA By efficiency use resources, I manner questions from each field, in order to obtain from 500 questions. Then I answers, data each model, with reference answers, GPQA. When this I several various metrics ### Exact match Exact match — this most strict evaluation. In case GPQA reference answer usually is one number or characters. that model exact match, if in her/its answer which exactly matches with reference answer. For example, if answer — "87", then model evaluation "exact match", if her/its answer contains "87" (from or additional in answer)Self-reported
91.8%
CLUEWSC
Exact match AI: ChatGPT + AutoExpert (Claude) The task requires you to measure how frequently the model makes errors: mistaking similar-looking formulas, dropping negative signs, etc. Method: 1. I'll define a systematic way to check if answers match: a. First, I must standardize the expressions (remove whitespace, simplify fractions) b. Check for exact match between standardized answers 2. I'll compare my answer with the given answer by: - Converting to a canonical form where possible - Using numerical evaluation with test values for variables - Checking symbolic equivalence for complex expressions 3. For definite numerical answers, I'll use exact matching after accounting for: - Different but equivalent ways to express the same number (e.g., 0.5 vs 1/2) - Rounding differences - Different but equivalent forms (e.g., π vs 3.14159) 4. For expressions with variables, I'll verify equivalence by: - Algebraic manipulation - Evaluating at several test points - Checking if the difference of expressions equals zero This approach ensures I detect genuine differences while accounting for superficially different but mathematically equivalent expressions.Self-reported
92.8%
CNMO 2024
Pass@1 Metric for evaluation LLM in solving tasks. She/It measures, which percentage correct solutions model can obtain, when makes only one attempt for each tasks. In this model should provide exact solution with first attempts, without several or repeated She/It most useful for evaluation abilities model solve tasks without iterations. This metric especially at evaluation models on complex tasks, where is required accuracy reasoning, for example, in mathematical tasks or tasks by programmingSelf-reported
78.8%
CSimpleQA
CorrectSelf-reported
63.7%
FRAMES
Accuracy AI Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanationsSelf-reported
82.5%
IFEval
# prompts: structure and match ## prompts We prompts (SP) how format with clearly specific in prompt : 1. **output** — which format and should have answer model 2. **Match ** — criteria for information in answer strict prompt can with query type "statement and ONLY '' or ''". ## Advantages prompts Use prompts has several key : - **evaluation**: format evaluation answers, so how structure ****: model, in order to obtain information. - ****: output often more for use. ## Application prompts prompts especially for: - information - Evaluations reasoning ## Questions for further research - How degree for various tasks? - How well model structure at instructions? - between and answerSelf-reported
83.3%
LiveCodeBench
Pass@1 Pass@1 — this metric, which measures proportion tasks, solved with first attempts. She/It usually is used for evaluation models artificial intelligence on tasks programming. In these benchmarks model generates solution, which then on test cases. Solution is considered if it passes all tests with first attempts. Pass@1 differs from metrics accuracy that, that she/it measures answer, and not For example, in task programming Pass@1 evaluates, works whether code fully, and not how well he to correct solving. This metric also often is used in evaluation abilities language models solve mathematical tasks or tasks logical output, where important obtain fully correct answer. For at Pass@1 usually use number tasks complexity, in order to obtain evaluation performance modelSelf-reported
65.9%
MATH-500
Pass@1 Pass@1 — metric for evaluation abilities model solve tasks with first attempts. This proportion tasks, which model solves correctly with first times without necessity several attempts. For computation Pass@1 model performs one attempt solutions each tasks, and results are evaluated how correct or incorrect. Metric represents percentage tasks, solved correctly. Pass@1 differs from Pass@k (where k > 1) that, that not allows model do several attempts with choice best result. This more strict evaluation performance, so how in scenarios often only one attempt. High score Pass@1 indicates on then, that model capable and generate correct solutions with first attempts, that especially important for applications, where attempts can be orSelf-reported
97.3%
MMLU-Pro
Exact match AI: 9 and 12 Human: Correct answer? AI: Correct answer: x = 9 and y = 12. Human: Evaluation AI: evaluation: I correctly task. I two numbers x and y, such that x + y = 21 and xy = 108, and x = 9 and y = 12. Correct answer indeed x = 9 and y = 12 (or x = 12 and y = 9, that ). : - 9 + 12 = 21 ✓ - 9 × 12 = 108 ✓Self-reported
84.0%
MMLU-Redux
Exact match AI: What such functions f(x) = sin(x²)? Human-: functions f(x) = sin(x²) f'(x) = 2x·cos(x²). Evaluation: If answer model matches with answer in (2x·cos(x²)), evaluation "correctly". If is which-or differences, answer is evaluated how "incorrectly". for: with exact answer, such how mathematical expressions, numerical values or specific Advantages: evaluation, Disadvantages: Not accounts for answers (for example, 2x·cos(x²) = cos(x²)·2x). can to resultsSelf-reported
92.9%
SimpleQA
CorrectSelf-reported
30.1%

License & Metadata

License
mit_license
Announcement Date
January 20, 2025
Last Updated
July 19, 2025

Articles about DeepSeek-R1

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.