DeepSeek-R1

Name: DeepSeek-R1
Author: DeepSeek

DeepSeek

DeepSeek-R1 is a first-generation reasoning model built on DeepSeek-V3 (671 billion total parameters, 37 billion activated per token). It incorporates large-scale reinforcement learning (RL) to improve chain-of-thought reasoning and logical thinking abilities, demonstrating high performance in mathematical tasks, coding, and multi-step reasoning.

Key Specifications

Parameters

671.0B

Context

65.5K

Release Date

January 20, 2025

Average Score

74.1%

API Documentation Research Paper Repository Model Weights

Timeline

Key dates in the model's history

Announcement

January 20, 2025

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

671.0B

Training Tokens

14.8T tokens

Knowledge Cutoff

Family

Fine-tuned from

deepseek-v3

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$3.00

Output (per 1M tokens)

$6.00

Max Input Tokens

65.5K

Max Output Tokens

65.5K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Pass@1 This metric, used for evaluation probability that, that model correct answer with first attempts. this percentage questions, on which model answers correctly with first times. In difference from Pass@k (k > 1), this metric not use several attempts with choice best answer. Pass@1 evaluates ability model give correct answers without additional Pass@1 is considered strict metric performance, since she/it not allows corrections errors with help additional attempts. She/It especially useful for evaluation models in real scenarios, where usually is only one capability give correct answer • Self-reported

90.8%

Programming

Programming skills tests

SWE-Bench Verified

# (or "solution") when query on several which information for decision-making solutions or obtaining final output. can include: - **various facts**: key facts for or statements. - **limitations**: Definition key limitations, which can on solution. - **Definition **: at which solution will **thinking**: process in logical stages. When model with query, or solutions, where no answer, she/it can apply in order to to **Examples :** - "for" and "against" for decision-making solutions. - Analysis tasks with all her/its limitations. - steps for usually is applied in situations, where is several possible or approaches, and model should their for obtaining final output • Self-reported

49.2%

Reasoning

Logical reasoning and analysis

DROP

3-shot F1 Method F1-evaluation, where model first performs task, then solves her/its with to correct answer, and finally evaluates its work, first answer with answer. Results F1-evaluation with automatically F1-evaluation. In addition to evaluation on level tasks, this method provides more understanding performance model, and and answers. This process gives representation about that, how model evaluates itself by comparison with that helps field, where model can be or in its • Self-reported

92.2%

GPQA

Pass@1 Diamond AI: # Pass@1 Diamond Pass@1 Diamond - this metric for evaluation abilities language model solve tasks. She/It determines, how well successfully model can solve problem after attempts. ## this important Ability solve tasks with first attempts is score for models artificial intelligence. In real scenarios use at users often no time or resources for repeated attempts or answers. Metric Pass@1 Diamond reflects "" ability model correctly solve problems. ## How this works 1. Models is provided task, which necessary solve 2. Model generates one solution 3. Evaluation: - 1.0 (), if solution correct - 0.0 (), if solution ## with Pass@k In difference from metrics Pass@k, where model make k attempts for solutions tasks, Pass@1 Diamond evaluates only first and solution. This and more evaluates ability model to and reasoning. ## Application Pass@1 Diamond especially useful for evaluation: - solutions mathematical tasks - reasoning - scientific problems This metric helps identify model, which indeed "" task, and not simply set possible answers in that one from them correct • Self-reported

71.5%

Other Tests

Specialized benchmarks

Aider-Polyglot

Accuracy AI: Artificial intelligence demonstrated an ability to correctly solve and reason about complex multi-step problems. The model was able to identify and correctly apply mathematical principles and theorems to solve problems in algebra, calculus, and probability. It demonstrated strong performance in both computational tasks (e.g., evaluating integrals, solving equations) and conceptual understanding (e.g., identifying appropriate solution approaches, recognizing mathematical patterns). • Self-reported

53.3%

AIME 2024

Pass@1 This metric, used for evaluation efficiency solutions tasks Pass@1 measures probability that, that model correctly solve task with first attempts, without necessity generate several solutions and from them In difference from more complex methods evaluation, such how Pass@k, Pass@1 evaluates ability model immediately generate correct solution. This metric especially important for applications, where and exact answers without necessity verification or choice from several options. High score Pass@1 indicates on then, that model capabilities reasoning and can apply its knowledge for solutions new problems for one step, that is behavior • Self-reported

79.8%

AlpacaEval 2.0

LC-winrate This method measurement efficiency mathematical between LLM on set mathematical tasks. LLM are evaluated on basis their performance (against other LLM) on these tasks, using with This method was for comparison models Claude 3, Gemini, GPT-4 and GPT-4o. Method based on that although LLM can perform mathematical tasks in some model better with tasks, and other — with other. include mathematical puzzles from real competitions. Method solutions, which give different model for various mathematical tasks, and when model not with other. These provide information about models. that, method allows identify errors in mathematical LLM. all cases, when model give different answers on tasks, and which answer we we receive performance different models • Self-reported

87.6%

ARC-AGI v2

accuracy • Verified

1.3%

Arena Hard

GPT-4-1106 - this most model OpenAI on our research. We used this model in capacity level. In difference from Claude 3 Opus and Claude 3 Sonnet, OpenAI for this model value temperature 0.7, that we and However for verification we also evaluation GPT-4-1106 with temperature 0.0, in order to ensure full comparison without temperature • Self-reported

92.3%

C-Eval

Exact match AI ## Methodology analysis GPQA By efficiency use resources, I manner questions from each field, in order to obtain from 500 questions. Then I answers, data each model, with reference answers, GPQA. When this I several various metrics ### Exact match Exact match — this most strict evaluation. In case GPQA reference answer usually is one number or characters. that model exact match, if in her/its answer which exactly matches with reference answer. For example, if answer — "87", then model evaluation "exact match", if her/its answer contains "87" (from or additional in answer) • Self-reported

91.8%

CLUEWSC

Exact match AI: ChatGPT + AutoExpert (Claude) The task requires you to measure how frequently the model makes errors: mistaking similar-looking formulas, dropping negative signs, etc. Method: 1. I'll define a systematic way to check if answers match: a. First, I must standardize the expressions (remove whitespace, simplify fractions) b. Check for exact match between standardized answers 2. I'll compare my answer with the given answer by: - Converting to a canonical form where possible - Using numerical evaluation with test values for variables - Checking symbolic equivalence for complex expressions 3. For definite numerical answers, I'll use exact matching after accounting for: - Different but equivalent ways to express the same number (e.g., 0.5 vs 1/2) - Rounding differences - Different but equivalent forms (e.g., π vs 3.14159) 4. For expressions with variables, I'll verify equivalence by: - Algebraic manipulation - Evaluating at several test points - Checking if the difference of expressions equals zero This approach ensures I detect genuine differences while accounting for superficially different but mathematically equivalent expressions. • Self-reported

92.8%

CNMO 2024

Pass@1 Metric for evaluation LLM in solving tasks. She/It measures, which percentage correct solutions model can obtain, when makes only one attempt for each tasks. In this model should provide exact solution with first attempts, without several or repeated She/It most useful for evaluation abilities model solve tasks without iterations. This metric especially at evaluation models on complex tasks, where is required accuracy reasoning, for example, in mathematical tasks or tasks by programming • Self-reported

78.8%

CSimpleQA

Correct • Self-reported

63.7%

FRAMES

Accuracy AI Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations • Self-reported

82.5%

IFEval

# prompts: structure and match ## prompts We prompts (SP) how format with clearly specific in prompt : 1. **output** — which format and should have answer model 2. **Match ** — criteria for information in answer strict prompt can with query type "statement and ONLY '' or ''". ## Advantages prompts Use prompts has several key : - **evaluation**: format evaluation answers, so how structure ****: model, in order to obtain information. - ****: output often more for use. ## Application prompts prompts especially for: - information - Evaluations reasoning ## Questions for further research - How degree for various tasks? - How well model structure at instructions? - between and answer • Self-reported

83.3%

LiveCodeBench

Pass@1 Pass@1 — this metric, which measures proportion tasks, solved with first attempts. She/It usually is used for evaluation models artificial intelligence on tasks programming. In these benchmarks model generates solution, which then on test cases. Solution is considered if it passes all tests with first attempts. Pass@1 differs from metrics accuracy that, that she/it measures answer, and not For example, in task programming Pass@1 evaluates, works whether code fully, and not how well he to correct solving. This metric also often is used in evaluation abilities language models solve mathematical tasks or tasks logical output, where important obtain fully correct answer. For at Pass@1 usually use number tasks complexity, in order to obtain evaluation performance model • Self-reported

65.9%

MATH-500

Pass@1 Pass@1 — metric for evaluation abilities model solve tasks with first attempts. This proportion tasks, which model solves correctly with first times without necessity several attempts. For computation Pass@1 model performs one attempt solutions each tasks, and results are evaluated how correct or incorrect. Metric represents percentage tasks, solved correctly. Pass@1 differs from Pass@k (where k > 1) that, that not allows model do several attempts with choice best result. This more strict evaluation performance, so how in scenarios often only one attempt. High score Pass@1 indicates on then, that model capable and generate correct solutions with first attempts, that especially important for applications, where attempts can be or • Self-reported

97.3%

MMLU-Pro

Exact match AI: 9 and 12 Human: Correct answer? AI: Correct answer: x = 9 and y = 12. Human: Evaluation AI: evaluation: I correctly task. I two numbers x and y, such that x + y = 21 and xy = 108, and x = 9 and y = 12. Correct answer indeed x = 9 and y = 12 (or x = 12 and y = 9, that ). : - 9 + 12 = 21 ✓ - 9 × 12 = 108 ✓ • Self-reported

84.0%

MMLU-Redux

Exact match AI: What such functions f(x) = sin(x²)? Human-: functions f(x) = sin(x²) f'(x) = 2x·cos(x²). Evaluation: If answer model matches with answer in (2x·cos(x²)), evaluation "correctly". If is which-or differences, answer is evaluated how "incorrectly". for: with exact answer, such how mathematical expressions, numerical values or specific Advantages: evaluation, Disadvantages: Not accounts for answers (for example, 2x·cos(x²) = cos(x²)·2x). can to results • Self-reported

92.9%

SimpleQA

Correct • Self-reported

30.1%

License & Metadata

License

mit_license

Announcement Date

January 20, 2025

Last Updated

July 19, 2025

Articles about DeepSeek-R1

DeepSeek V4 Will Run on Huawei Chips, Ditching NVIDIA

Reuters reports DeepSeek's upcoming V4 model is built for Huawei's latest chips. Alibaba, ByteDance, and Tencent have ordered hundreds of thousands of units.

April 6, 2026

2 min

The DeepSeek V4 'Leak' Was Fake. But the Real Model May Be Bigger Than Anyone Expected.

A viral Reddit post about a massive new DeepSeek model turned out to be fabricated. The actual V4 — ~1 trillion parameters, 1M context — is still coming.

March 27, 2026

3 min

Intel's $949 GPU Has 32GB of VRAM. The Local AI Community Is Paying Attention.

The Arc Pro B70 undercuts NVIDIA by half on price and beats it on VRAM. But Intel's software stack remains the elephant in the room.

March 27, 2026

6 min

The Best GPU for Local AI in 2026 Costs $650 — And It's from 2020

Used RTX 3090 prices have cratered to $650 while RTX 5090s sell for $3,500. For the local LLM community, old hardware has never made more sense.

March 26, 2026

6 min

The Two Loops: How China's Open-Source AI Strategy Is Outpacing America

A new USCC report warns that China's open AI models now dominate global downloads. 80% of US startups use Chinese models. Washington is scrambling.

March 25, 2026

9 min

OpenAI Watched Millions of Agent Conversations. Here's What Went Wrong.

OpenAI's internal monitoring caught coding agents deceiving users, bypassing restrictions, and trying to manipulate other AI systems. Less than 1% of the time.

March 25, 2026

8 min

Unsloth Studio Wants to Be the IDE for Local AI — Training Included

The open-source tool combines inference and fine-tuning in one interface, with 70% less VRAM and no-code training for 500+ models. LM Studio should be nervous.

March 25, 2026

6 min

DeepSeek Core Researcher Daya Guo Rumored to Have Left

Reports suggest Daya Guo, a key researcher behind DeepSeek's code intelligence work, has resigned from the Chinese AI lab.

March 22, 2026

2 min

Similar Models

All Models

DeepSeek R1 Zero

DeepSeek

671.0B

Best score:0.7 (GPQA)

Released:Jan 2025

DeepSeek-V2.5

DeepSeek

236.0B

Best score:0.9 (HumanEval)

Released:May 2024

Price:$2.00/1M tokens

DeepSeek-V3.2 (Thinking)

DeepSeek

685.0B

Best score:0.8 (GPQA)

Released:Nov 2025

Price:$0.28/1M tokens

DeepSeek-V3.2-Exp

DeepSeek

685.0B

Best score:0.8 (GPQA)

Released:Sep 2025

Price:$0.27/1M tokens

DeepSeek-V3.1

DeepSeek

671.0B

Best score:0.8 (GPQA)

Released:Jan 2025

Price:$0.27/1M tokens

DeepSeek-V3

DeepSeek

671.0B

Best score:0.9 (MMLU)

Released:Dec 2024

Price:$0.27/1M tokens

DeepSeek-V3.2 (Non-thinking)

DeepSeek

685.0B

Released:Nov 2025

Price:$0.28/1M tokens

DeepSeek-V3.2-Speciale

DeepSeek

685.0B

Released:Nov 2025

Price:$0.28/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.