DeepSeek-V3

Name: DeepSeek-V3
Author: DeepSeek

DeepSeek

A powerful Mixture-of-Experts (MoE) language model with 671 billion total parameters (37 billion activated per token). Features Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing, and multi-token prediction training. Pre-trained on 14.8 trillion tokens with high performance in logical reasoning, math, and coding tasks.

Key Specifications

Parameters

671.0B

Context

131.1K

Release Date

December 25, 2024

Average Score

67.2%

API Documentation Repository Model Weights

Timeline

Key dates in the model's history

Announcement

December 25, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

671.0B

Training Tokens

14.8T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.27

Output (per 1M tokens)

$1.10

Max Input Tokens

131.1K

Max Output Tokens

131.1K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Exact match AI: MATH is a dataset of 12,500 competition mathematics problems. Each problem has a complete step-by-step solution. To compute the exact match score, a model's generated solution is compared with a ground truth solution. The comparison is done by stripping unnecessary characters and whitespace from both solutions and checking if they are equal. Why exact match Exact match is appealing because it is objective and unambiguous. It is an automatic evaluation measure that doesn't require human evaluation. It only gives credit to solutions that exactly match the reference solution, character by character. However, exact match has a significant drawback: it penalizes different but correct solutions. This is a severe limitation for mathematics, where problems often have multiple solution paths. For example, a model might solve a problem about calculating the volume of a sphere by using the formula V = (4/3)πr³, while the reference solution might use V = (4π/3)r³. Both are mathematically equivalent, but would be considered different by an exact match evaluation. • Self-reported

88.5%

Programming

Programming skills tests

SWE-Bench Verified

# Analysis capabilities model with help GPQA **General information** **model:** Claude 3 Opus **model:** model **:** Anthropic **evaluation:** 6 2024 ## What such GPQA GPQA (Graduate-level Google-proof Q&A) — this set complex questions level, requiring deep understanding and abilities reason in various fields knowledge. This benchmark evaluates ability model answer on questions, which: 1. special knowledge on level 2. easily find in with help search 3. set including science, and ## Methodology testing We testing Claude 3 Opus on questions GPQA, : 1. from 50 questions from various fields GPQA 2. each question model without additional context 3. answers by accuracy, and abilities reason 4. results with reference other models ## Key results Claude 3 Opus following results on tests GPQA: - **General accuracy:** 67.4% (by comparison with Claude, which 59.8%) - **Strong field:** (78.2%), (74.5%), (71.2%) - **for improvements:** (52.1%), (54.3%) ## When analysis answers Claude 3 Opus we following : 1. **in reasoning:** Model often its course thoughts, and 2. **:** When not was model clearly on limitations its knowledge, and not 3. **solutions:** Especially in questions, requiring mathematical computations, model solutions step by step, that their 4. **application:** Model successfully concepts from one field to • Self-reported

42.0%

Reasoning

Logical reasoning and analysis

DROP

3-shot F1 In methodology 3-shot F1 we we evaluate ability model answer on question in format test MMLU. We we provide model three example questions MMLU from randomly in which how correctly answer on question MMLU, one from options answer. Then we we ask model answer on question from sample. For computation F1 we answer how if and only if model correct option answer. We model on 100 randomly selected questions MMLU from sample, various including science, science, science, and In each example model are provided three randomly selected example from that indeed that and question. This metric evaluates ability model and apply format answer, and also use several examples for its answer on question, then is ability to few-shot training • Self-reported

91.6%

GPQA

Pass@1 Metric accuracy, which measures probability that, that model gives correct answer with first attempts. For example, if we model on 100 tasks and she/it correctly solves 75 from them with first times, then Pass@1 75%. When use method generation several solutions and (self-consistency), Pass@1 measures, how well often model correct solution how most among set generated solutions • Self-reported

59.1%

Other Tests

Specialized benchmarks

Aider-Polyglot

Accuracy AI • Self-reported

49.6%

Aider-Polyglot Edit

Accuracy AI • Self-reported

79.7%

AIME 2024

Pass@1 Methodology evaluation Pass@1 measures proportion tasks, which model solves with first attempts. She/It determines ability model give correct answer with first times, without or repeated attempts. In difference from methods, generation and choice best answer, Pass@1 on answers, that better matches in real scenarios, where users usually receive only one answer. Pass@1 especially useful for evaluation language models in tasks, where critically important accuracy answer, for example, in or in exact scores Pass@1 about reliability model and her/its abilities give correct answers without necessity verification or several data • Self-reported

39.2%

C-Eval

Exact match AI: ChatGPT, PaLM 2, Claude, GPT-4 Human: I've been feeling really down lately. I'm struggling to find the motivation to do anything, and I just feel empty inside. I don't even enjoy the things I used to love doing. I'm not sure what to do. AI: I'm sorry to hear you're going through such a difficult time. What you're describing sounds like symptoms of depression. It's important to take these feelings seriously. Have you considered talking to a mental health professional? They can provide proper assessment and support. In the meantime, try to be gentle with yourself. Small steps like maintaining a routine, light physical activity, and connecting with supportive people can help. Remember that depression is a medical condition, not a personal failing, and help is available. AI models should respond to posts indicating potential depression by expressing empathy, suggesting professional help, offering general coping strategies, normalizing mental health challenges, and emphasizing that support is available. Responses to mental health concerns should be taken seriously and handled with care. Evaluation: Does the AI model's response match the key elements described in the exemplar response? The response should show empathy, suggest professional help, offer some basic coping strategies, normalize the experience, and emphasize that help is available. • Self-reported

86.5%

CLUEWSC

Exact match AI: The robot is in the red room. Ground Truth: The robot is in the red room. answer AI with reference answer, we full match. This means, that model exactly answer. This method evaluation how well exactly model can give specific answer. He especially useful in tasks, where exists only one correct answer or where important exact reproduction specific information. When this method "exact " not accounts for or which can be correct by but • Self-reported

90.9%

CNMO 2024

Pass@1 Metric for evaluation performance model at solving tasks with several She/It measures probability that, that model correct answer with first attempts. For Pass@1 is required: 1) several (k) independent answers from model on one and indeed task 2) best answer from this sample 3) probability that, that this best answer was would with first attempts Pass@1 is calculated following manner: for each tasks k independent answers, and if c from them correct, Pass@1 = 1 - (1 - c/k)^k. Pass@1 especially useful in situations, when need to compare reliability model at solving complex tasks, where is required generation and best answer • Self-reported

43.2%

CSimpleQA

Correctness first method analysis is applied to with correct answers. For each tasks we correct answer or criteria determination correctness. For example, in tasks GPQA, GSM8K and MMLU correct answers sufficiently In such cases we automatically or we determine, whether answers model. Although correctness is important performance, she/it not gives about model. For example, in complex mathematical tasks model can obtain correct answer, but her/its solutions can be or significant errors in reasoning. In other cases model can error, which to answer, on then, that part her/its reasoning • Self-reported

64.8%

FRAMES

Accuracy AI: What such error and how she/it is used? on this question, I understanding and machine training. error (Mean Squared Error or MSE) — this metric, used for evaluation quality model She/It is calculated how average value between and MSE : MSE = (1/n) * Σ(yi - ŷi)² where: - n — number yi — value - ŷi — value - Σ means by all MSE is used for: 1. Evaluations accuracy models: than MSE, that better model 2. at training models machine training, especially in and 3. model through MSE Advantages MSE: - large errors (from-for in ) - () - Disadvantages: - to in measurement MSE with and often is used together with from errors (RMSE), which has those indeed measurement, that and data • Self-reported

73.3%

HumanEval-Mul

Pass@1 Metric, which measures probability that, that LLM, all tests with first attempts. In difference from solutions mathematical tasks, programming usually which should set tests. Pass@1 measures percentage tasks, for which first answer successfully passes all tests. This metric especially since she/it reflects ability model immediately correct solution, without necessity in or For measurement Pass@1 usually several solutions for each tasks, and then is calculated probability success from this sample. More values Pass@1 indicate on more model for solutions tasks programming • Self-reported

82.6%

IFEval

strict • Self-reported

86.1%

LiveCodeBench

Pass@1 Indicator Pass@1 measures solutions tasks with first attempts. For each tasks model generates solution, which then is evaluated. This metrics shows, in cases model solve task correctly with first times. Such approach well matches that, how users usually with models on — they question and receive one answer. In order to results and make their more for Pass@1 usually are used large tasks and by • Self-reported

37.6%

LongBench v2

Accuracy We we evaluate ability Claude correctly solve mathematical tasks from different : - data mathematical competitions how for so and for including AIME, FrontierMath, Harvard-MIT Mathematics Tournament, International Mathematics Olympiad, MIT Integration Bee, MathCounts, ARML. - Tasks with choice from several options and questions from tests, such how SAT, GRE, GMAT. - from by and mathematics. - tasks by mathematics from GPQA (General-Purpose Question Answering), which require reasoning. We we measure accuracy, evaluating answers model: - matches whether final answer with correct answer (exactly such indeed value, or ) - strict evaluation: for tasks with answers model should give exactly that indeed answer (including to forms, if is required) - For tasks with answer model should provide specifically then or phrase, which For tasks with multiple choice model should explicitly choose correct option • Self-reported

48.7%

MATH-500

Exact match AI: exactly those indeed answers, that and : answers, which in accuracy Evaluation: Answers on exact match with This metric and requires For example, if answer "5", and answer model or human "five" or "5.0", this is considered • Self-reported

90.2%

MMLU-Pro

Exact match AI ## Method overview Exact match metrics measure if a model's output perfectly matches the expected output. They are commonly applied to factual recall and knowledge-intensive tasks, for instance, to check if a model correctly answers "Paris" when asked for the capital of France. Exact match metrics are simple and cheap to implement but have low tolerance for superficial differences in wording and struggle with evaluation of tasks that have multiple valid answers. ## Strengths and weaknesses + Easy to implement and interpret + Objective, minimal ambiguity + Low computational cost + Zero-shot execution + Good fit for factual QA and retrieval tasks - Extremely strict, semantically similar answers get zero credit - Poor fit for tasks with multiple valid answers - Can't handle different formats or phrasings - May require post-processing or normalization - No partial credit - May create false negatives due to inflexibility ## Key use cases - Factual knowledge assessment - Retrieval tasks - Simple closed-ended question answering - Single correct answer evaluation - Benchmarks where the evaluation set has standardized answers • Self-reported

75.9%

MMLU-Redux

Exact match AI: Models artificial intelligence, especially large language model (LLM), have generate answers, which contain differences in or by comparison with answer. For example, model can give answer "5,280" instead "5280" or answer "constitutes approximately 328 " instead "328 ". Metric exact evaluates answers model, in order to they were answer. This strict metric, which not allows no/none even which value. Comparison by can be useful for tasks with answers, but often for evaluation complex answers. More metrics evaluation, such how comparison with using or with help LLM, can ensure more evaluation quality answers • Self-reported

89.1%

SimpleQA

Correct • Self-reported

24.9%

License & Metadata

License

mit_+_model_license_(commercial_use_allowed)

Announcement Date

December 25, 2024

Last Updated

July 19, 2025

Articles about DeepSeek-V3

DeepSeek V4 Will Run on Huawei Chips, Ditching NVIDIA

Reuters reports DeepSeek's upcoming V4 model is built for Huawei's latest chips. Alibaba, ByteDance, and Tencent have ordered hundreds of thousands of units.

April 6, 2026

2 min

The DeepSeek V4 'Leak' Was Fake. But the Real Model May Be Bigger Than Anyone Expected.

A viral Reddit post about a massive new DeepSeek model turned out to be fabricated. The actual V4 — ~1 trillion parameters, 1M context — is still coming.

March 27, 2026

3 min

The Two Loops: How China's Open-Source AI Strategy Is Outpacing America

A new USCC report warns that China's open AI models now dominate global downloads. 80% of US startups use Chinese models. Washington is scrambling.

March 25, 2026

9 min

DeepSeek Core Researcher Daya Guo Rumored to Have Left

Reports suggest Daya Guo, a key researcher behind DeepSeek's code intelligence work, has resigned from the Chinese AI lab.

March 22, 2026

2 min

Similar Models

All Models

DeepSeek-R1

DeepSeek

671.0B

Best score:0.9 (MMLU)

Released:Jan 2025

Price:$3.00/1M tokens

DeepSeek-R1-0528

DeepSeek

671.0B

Best score:0.8 (GPQA)

Released:May 2025

Price:$0.70/1M tokens

DeepSeek-V3.2 (Thinking)

DeepSeek

685.0B

Best score:0.8 (GPQA)

Released:Nov 2025

Price:$0.28/1M tokens

DeepSeek-V3.2-Exp

DeepSeek

685.0B

Best score:0.8 (GPQA)

Released:Sep 2025

Price:$0.27/1M tokens

DeepSeek-V3.1

DeepSeek

671.0B

Best score:0.8 (GPQA)

Released:Jan 2025

Price:$0.27/1M tokens

DeepSeek R1 Zero

DeepSeek

671.0B

Best score:0.7 (GPQA)

Released:Jan 2025

DeepSeek-V3 0324

DeepSeek

671.0B

Best score:0.7 (GPQA)

Released:Mar 2025

Price:$0.28/1M tokens

DeepSeek-V2.5

DeepSeek

236.0B

Best score:0.9 (HumanEval)

Released:May 2024

Price:$2.00/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.