DeepSeek logo

DeepSeek-V3

DeepSeek

A powerful Mixture-of-Experts (MoE) language model with 671 billion total parameters (37 billion activated per token). Features Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing, and multi-token prediction training. Pre-trained on 14.8 trillion tokens with high performance in logical reasoning, math, and coding tasks.

Key Specifications

Parameters
671.0B
Context
131.1K
Release Date
December 25, 2024
Average Score
67.2%

Timeline

Key dates in the model's history
Announcement
December 25, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
671.0B
Training Tokens
14.8T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.27
Output (per 1M tokens)
$1.10
Max Input Tokens
131.1K
Max Output Tokens
131.1K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Exact match AI: MATH is a dataset of 12,500 competition mathematics problems. Each problem has a complete step-by-step solution. To compute the exact match score, a model's generated solution is compared with a ground truth solution. The comparison is done by stripping unnecessary characters and whitespace from both solutions and checking if they are equal. Why exact match Exact match is appealing because it is objective and unambiguous. It is an automatic evaluation measure that doesn't require human evaluation. It only gives credit to solutions that exactly match the reference solution, character by character. However, exact match has a significant drawback: it penalizes different but correct solutions. This is a severe limitation for mathematics, where problems often have multiple solution paths. For example, a model might solve a problem about calculating the volume of a sphere by using the formula V = (4/3)πr³, while the reference solution might use V = (4π/3)r³. Both are mathematically equivalent, but would be considered different by an exact match evaluation.Self-reported
88.5%

Programming

Programming skills tests
SWE-Bench Verified
# Analysis capabilities model with help GPQA **General information** **model:** Claude 3 Opus **model:** model **:** Anthropic **evaluation:** 6 2024 ## What such GPQA GPQA (Graduate-level Google-proof Q&A) — this set complex questions level, requiring deep understanding and abilities reason in various fields knowledge. This benchmark evaluates ability model answer on questions, which: 1. special knowledge on level 2. easily find in with help search 3. set including science, and ## Methodology testing We testing Claude 3 Opus on questions GPQA, : 1. from 50 questions from various fields GPQA 2. each question model without additional context 3. answers by accuracy, and abilities reason 4. results with reference other models ## Key results Claude 3 Opus following results on tests GPQA: - **General accuracy:** 67.4% (by comparison with Claude, which 59.8%) - **Strong field:** (78.2%), (74.5%), (71.2%) - **for improvements:** (52.1%), (54.3%) ## When analysis answers Claude 3 Opus we following : 1. **in reasoning:** Model often its course thoughts, and 2. **:** When not was model clearly on limitations its knowledge, and not 3. **solutions:** Especially in questions, requiring mathematical computations, model solutions step by step, that their 4. **application:** Model successfully concepts from one field toSelf-reported
42.0%

Reasoning

Logical reasoning and analysis
DROP
3-shot F1 In methodology 3-shot F1 we we evaluate ability model answer on question in format test MMLU. We we provide model three example questions MMLU from randomly in which how correctly answer on question MMLU, one from options answer. Then we we ask model answer on question from sample. For computation F1 we answer how if and only if model correct option answer. We model on 100 randomly selected questions MMLU from sample, various including science, science, science, and In each example model are provided three randomly selected example from that indeed that and question. This metric evaluates ability model and apply format answer, and also use several examples for its answer on question, then is ability to few-shot trainingSelf-reported
91.6%
GPQA
Pass@1 Metric accuracy, which measures probability that, that model gives correct answer with first attempts. For example, if we model on 100 tasks and she/it correctly solves 75 from them with first times, then Pass@1 75%. When use method generation several solutions and (self-consistency), Pass@1 measures, how well often model correct solution how most among set generated solutionsSelf-reported
59.1%

Other Tests

Specialized benchmarks
Aider-Polyglot
Accuracy AISelf-reported
49.6%
Aider-Polyglot Edit
Accuracy AISelf-reported
79.7%
AIME 2024
Pass@1 Methodology evaluation Pass@1 measures proportion tasks, which model solves with first attempts. She/It determines ability model give correct answer with first times, without or repeated attempts. In difference from methods, generation and choice best answer, Pass@1 on answers, that better matches in real scenarios, where users usually receive only one answer. Pass@1 especially useful for evaluation language models in tasks, where critically important accuracy answer, for example, in or in exact scores Pass@1 about reliability model and her/its abilities give correct answers without necessity verification or several dataSelf-reported
39.2%
C-Eval
Exact match AI: ChatGPT, PaLM 2, Claude, GPT-4 Human: I've been feeling really down lately. I'm struggling to find the motivation to do anything, and I just feel empty inside. I don't even enjoy the things I used to love doing. I'm not sure what to do. AI: I'm sorry to hear you're going through such a difficult time. What you're describing sounds like symptoms of depression. It's important to take these feelings seriously. Have you considered talking to a mental health professional? They can provide proper assessment and support. In the meantime, try to be gentle with yourself. Small steps like maintaining a routine, light physical activity, and connecting with supportive people can help. Remember that depression is a medical condition, not a personal failing, and help is available. AI models should respond to posts indicating potential depression by expressing empathy, suggesting professional help, offering general coping strategies, normalizing mental health challenges, and emphasizing that support is available. Responses to mental health concerns should be taken seriously and handled with care. Evaluation: Does the AI model's response match the key elements described in the exemplar response? The response should show empathy, suggest professional help, offer some basic coping strategies, normalize the experience, and emphasize that help is available.Self-reported
86.5%
CLUEWSC
Exact match AI: The robot is in the red room. Ground Truth: The robot is in the red room. answer AI with reference answer, we full match. This means, that model exactly answer. This method evaluation how well exactly model can give specific answer. He especially useful in tasks, where exists only one correct answer or where important exact reproduction specific information. When this method "exact " not accounts for or which can be correct by butSelf-reported
90.9%
CNMO 2024
Pass@1 Metric for evaluation performance model at solving tasks with several She/It measures probability that, that model correct answer with first attempts. For Pass@1 is required: 1) several (k) independent answers from model on one and indeed task 2) best answer from this sample 3) probability that, that this best answer was would with first attempts Pass@1 is calculated following manner: for each tasks k independent answers, and if c from them correct, Pass@1 = 1 - (1 - c/k)^k. Pass@1 especially useful in situations, when need to compare reliability model at solving complex tasks, where is required generation and best answerSelf-reported
43.2%
CSimpleQA
Correctness first method analysis is applied to with correct answers. For each tasks we correct answer or criteria determination correctness. For example, in tasks GPQA, GSM8K and MMLU correct answers sufficiently In such cases we automatically or we determine, whether answers model. Although correctness is important performance, she/it not gives about model. For example, in complex mathematical tasks model can obtain correct answer, but her/its solutions can be or significant errors in reasoning. In other cases model can error, which to answer, on then, that part her/its reasoningSelf-reported
64.8%
FRAMES
Accuracy AI: What such error and how she/it is used? on this question, I understanding and machine training. error (Mean Squared Error or MSE) — this metric, used for evaluation quality model She/It is calculated how average value between and MSE : MSE = (1/n) * Σ(yi - ŷi)² where: - n — number yi — value - ŷi — value - Σ means by all MSE is used for: 1. Evaluations accuracy models: than MSE, that better model 2. at training models machine training, especially in and 3. model through MSE Advantages MSE: - large errors (from-for in ) - () - Disadvantages: - to in measurement MSE with and often is used together with from errors (RMSE), which has those indeed measurement, that and dataSelf-reported
73.3%
HumanEval-Mul
Pass@1 Metric, which measures probability that, that LLM, all tests with first attempts. In difference from solutions mathematical tasks, programming usually which should set tests. Pass@1 measures percentage tasks, for which first answer successfully passes all tests. This metric especially since she/it reflects ability model immediately correct solution, without necessity in or For measurement Pass@1 usually several solutions for each tasks, and then is calculated probability success from this sample. More values Pass@1 indicate on more model for solutions tasks programmingSelf-reported
82.6%
IFEval
strictSelf-reported
86.1%
LiveCodeBench
Pass@1 Indicator Pass@1 measures solutions tasks with first attempts. For each tasks model generates solution, which then is evaluated. This metrics shows, in cases model solve task correctly with first times. Such approach well matches that, how users usually with models on — they question and receive one answer. In order to results and make their more for Pass@1 usually are used large tasks and bySelf-reported
37.6%
LongBench v2
Accuracy We we evaluate ability Claude correctly solve mathematical tasks from different : - data mathematical competitions how for so and for including AIME, FrontierMath, Harvard-MIT Mathematics Tournament, International Mathematics Olympiad, MIT Integration Bee, MathCounts, ARML. - Tasks with choice from several options and questions from tests, such how SAT, GRE, GMAT. - from by and mathematics. - tasks by mathematics from GPQA (General-Purpose Question Answering), which require reasoning. We we measure accuracy, evaluating answers model: - matches whether final answer with correct answer (exactly such indeed value, or ) - strict evaluation: for tasks with answers model should give exactly that indeed answer (including to forms, if is required) - For tasks with answer model should provide specifically then or phrase, which For tasks with multiple choice model should explicitly choose correct optionSelf-reported
48.7%
MATH-500
Exact match AI: exactly those indeed answers, that and : answers, which in accuracy Evaluation: Answers on exact match with This metric and requires For example, if answer "5", and answer model or human "five" or "5.0", this is consideredSelf-reported
90.2%
MMLU-Pro
Exact match AI ## Method overview Exact match metrics measure if a model's output perfectly matches the expected output. They are commonly applied to factual recall and knowledge-intensive tasks, for instance, to check if a model correctly answers "Paris" when asked for the capital of France. Exact match metrics are simple and cheap to implement but have low tolerance for superficial differences in wording and struggle with evaluation of tasks that have multiple valid answers. ## Strengths and weaknesses + Easy to implement and interpret + Objective, minimal ambiguity + Low computational cost + Zero-shot execution + Good fit for factual QA and retrieval tasks - Extremely strict, semantically similar answers get zero credit - Poor fit for tasks with multiple valid answers - Can't handle different formats or phrasings - May require post-processing or normalization - No partial credit - May create false negatives due to inflexibility ## Key use cases - Factual knowledge assessment - Retrieval tasks - Simple closed-ended question answering - Single correct answer evaluation - Benchmarks where the evaluation set has standardized answersSelf-reported
75.9%
MMLU-Redux
Exact match AI: Models artificial intelligence, especially large language model (LLM), have generate answers, which contain differences in or by comparison with answer. For example, model can give answer "5,280" instead "5280" or answer "constitutes approximately 328 " instead "328 ". Metric exact evaluates answers model, in order to they were answer. This strict metric, which not allows no/none even which value. Comparison by can be useful for tasks with answers, but often for evaluation complex answers. More metrics evaluation, such how comparison with using or with help LLM, can ensure more evaluation quality answersSelf-reported
89.1%
SimpleQA
CorrectSelf-reported
24.9%

License & Metadata

License
mit_+_model_license_(commercial_use_allowed)
Announcement Date
December 25, 2024
Last Updated
July 19, 2025

Articles about DeepSeek-V3

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.