DeepSeek-V2.5

Name: DeepSeek-V2.5
Author: DeepSeek

DeepSeek

DeepSeek-V2.5 is an enhanced version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct, integrating general capabilities and coding skills. The model better aligns with human preferences and has been optimized in various aspects, including writing and instruction following.

Key Specifications

Parameters

236.0B

Context

8.2K

Release Date

May 8, 2024

Average Score

71.1%

API Documentation Research Paper Repository Model Weights

Timeline

Key dates in the model's history

Announcement

May 8, 2024

Last Update

July 19, 2025

Today

March 25, 2026

Technical Specifications

Parameters

236.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$2.00

Output (per 1M tokens)

$2.00

Max Input Tokens

8.2K

Max Output Tokens

8.2K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Score Evaluation • Self-reported

80.4%

Programming

Programming skills tests

HumanEval

Pass@1 Metric for determination probability that, that model correct solution with first attempts. ability model solve tasks with first attempts, without necessity several attempts and This metric especially important for evaluation performance models on complex mathematical and tasks, where only fully correct answers. Model with Pass@1 can more for applications, where is required accuracy with first attempts • Self-reported

89.0%

SWE-Bench Verified

## Evaluation Evaluation — this measurement degree or quality answer model. We in mainly we use two type evaluations: evaluation on basis and evaluation experts-people. ### evaluation on basis evaluation on basis to tasks with answers or evaluation can be to such how "correctly/incorrectly", numerical points or evaluation by scale (for example, from 1 to 5). ### Evaluations experts For evaluation tasks, requiring or complex reasoning, we on evaluation experts-people. evaluate answers on basis which difficult such how nuances, accuracy, and other aspects quality ### accuracy accuracy how well well model gives correct answers on questions. Examples metrics accuracy include: - Accuracy: percentage correct answers. - F1-score: weighted accuracy and completeness. - : measure that, how well well model ### evaluation evaluation model by comparison with other models or They can include: - people: when people ask choose between answers different models. - : system where model or on basis that, how well well they work against • Self-reported

16.8%

Mathematics

Mathematical problems and computations

GSM8k

Score • Self-reported

95.1%

MATH

# Evaluation We we use system evaluation, that, that is applied in by evaluation FrontierMath and Olympiad-Level GPQA. All solutions are evaluated by scale from 0 to 5: **5**: solution with errors. **4**: Partially correct solution with and only minor errors. **3**: Solution, progress in but with more errors. **2**: progress, but very from correct solutions. **1**: progress. **0**: All solutions were in evaluationwere by means of • Self-reported

74.7%

Other Tests

Specialized benchmarks

Aider

Score We evaluation model by specific categories, which determines general score from 1 to 10. For each categories our process evaluation combines how scores benchmarks, so and analysis, on our Key evaluation: • and scores • performance over capabilities • score for comparison between models • from 1 to 10, where 10 represents capabilities Important note, that evaluation for comparison performance between models, and not for determination Evaluation 10 in categories reflects on performance, which can be models • Self-reported

72.2%

AlignBench

# Score This method evaluates results in with evaluation. ## Process 1. criteria evaluation. should be: - for tasks 2. evaluation with for each level performance. For example: - 1-3: 4-7: 8-10: Outperforms 3. results by each using 4. general score, using (for example, average value, ). 5. general score with for determination general performance. ## Example **Task**: ability model solve mathematical tasks. ****: - Correctness (1-10) - Efficiency method (1-10) - explanations (1-10) **evaluation**: (0.5 × Correctness) + (0.3 × Efficiency) + (0.2 × ) **values**: - <5: 5-7: 7-9: Good - >9: ## Limitations - in and values - all aspects performance - Can not aspects, which difficult measure • Self-reported

80.4%

AlpacaEval 2.0

Score This field includes training LLM solving tasks by means of numerical evaluations. We we consider two main approach: 1. points: Model directly value. For example, in task GPQA can ask model evaluate answer by scale from 1 to 5. 2. points: Model performs intermediate steps reasoning, which then in value. For example: - (0 or 1) - Computation in several with choice with Definition evaluation, and then evaluation by this criteria Models can also give on basis these evaluations, for example, several options. Advantages: - Allows models perform comparison without necessity decision-making solutions - more information about model - Can be more than simple Disadvantages: - Some model can numerical evaluation - interpret value specific numerical values - Not always which system evaluations for specific tasks • Self-reported

50.5%

Arena Hard

# Evaluation After that, how all tasks, their answers how correct or incorrect. Then we accuracy for each from 50 tasks and for each from for data, values results can be above or below, than "" values. For solutions this problems we in order to obtain more evaluation accuracy and For each we 1000 with from sample and accuracy by this Then we we compute 95% using method, where matches 2,5-and — 97,5-accuracy by For results we first we compute accuracy by all tasks for each sample, and then average and 95% these values • Self-reported

76.2%

BBH

Score AI: Evaluation • Self-reported

84.3%

DS-Arena-Code

## Evaluation Evaluations accuracy match answers model correct answers. In this correct answers presented in two : * **Answers with multiple choice**: We we compare answer model with correct answer, in order to whether they. * **Answers, requiring reasoning**: We answer, which matches correct answer, even if process solutions differs. For answers with multiple choice evaluation constitutes 1.0, if answer model matches correct answer, and 0.0 in case. For answers, requiring reasoning, evaluation constitutes 1.0, if answer model correct, and 0.0 in case. General evaluation — this average value all evaluations. In order to answers with multiple choice, we option answer, which in end answer model, for example, "answer: B". For answers, requiring reasoning, we answer in end answer model. If model provides answers or we not we can answer, evaluation constitutes 0.0 • Self-reported

63.1%

DS-FIM-Eval

# Score We developed system evaluation, which accounts for answers in form, degree necessity prompts and process our with model at solving problems. For each problems we we evaluate level help, which model for achievements correct answer. system evaluation: - **score (1.0)**: Model immediately provides correct answer without any-or prompts or additional questions. - **score (0.5)**: Model to correct answer, but with help (for example, prompts about errors or queries solution). This also includes cases, when model in its answer. - **score (0.0)**: Model not can to correct answer, even with help. This system evaluation ability model correct answers with more use, where can be for model on correct • Self-reported

78.3%

HumanEval-Mul

Pass@1 metric measures probability solutions tasks with first attempts. In order to her/its model generates n samples answers. is considered answer, with evaluation Pass@1 by means of evaluation probability that, that one randomly answer from n generated will correct. : 1. Model generates n samples answers for each tasks 2. Each is evaluated how correct or incorrect 3. success at one answer from n is calculated how c/n, where c - number correct answers 4. Pass@1 for total set tasks - this average value by all tasks This metric often is used for evaluation models in solving tasks programming and mathematical tasks. She/It shows, how well well model can generate correct answer with first attempts without capabilities corrections or solutions • Self-reported

73.8%

LiveCodeBench(01-09)

## Score Score - this method analysis mathematical abilities models artificial intelligence through reasoning about possible and in their own We method training GPT-4, which we "Score", for execution such analysis without human. Score verifies its own solutions mathematical tasks, and logical errors, approaches and that answers all corresponding On set data mathematical tasks Score achieves accuracy 83.0%, significantly GPT-4 (58.0%) and (75.0%). details Score: 1. We with that, that GPT-4 solve task. 2. Then Score this solution. 3. Score possible problems and solutions. 4. Score verifies solution step for step and final answer. Score is its ability and correct errors, which from-for mathematical LLM. Score in degree outperforms methods with several solutions and approaches with maintaining at this high efficiency. This shows, that LLM can significantly improve its mathematical abilities, if them provide for reasoning about and in their own • Self-reported

41.8%

MT-Bench

Score • Self-reported

90.2%

License & Metadata

License

deepseek

Announcement Date

May 8, 2024

Last Updated

July 19, 2025

Similar Models

All Models

DeepSeek-V3.2 (Thinking)

DeepSeek

685.0B

Best score:0.8 (GPQA)

Released:Nov 2025

Price:$0.28/1M tokens

DeepSeek-V3.2-Exp

DeepSeek

685.0B

Best score:0.8 (GPQA)

Released:Sep 2025

Price:$0.27/1M tokens

DeepSeek-R1

DeepSeek

671.0B

Best score:0.9 (MMLU)

Released:Jan 2025

Price:$3.00/1M tokens

DeepSeek-V3.2 (Non-thinking)

DeepSeek

685.0B

Released:Nov 2025

Price:$0.28/1M tokens

DeepSeek R1 Zero

DeepSeek

671.0B

Best score:0.7 (GPQA)

Released:Jan 2025

DeepSeek-V3.2-Speciale

DeepSeek

685.0B

Released:Nov 2025

Price:$0.28/1M tokens

DeepSeek-R1-0528

DeepSeek

671.0B

Best score:0.8 (GPQA)

Released:May 2025

Price:$0.70/1M tokens

DeepSeek-V3.1

DeepSeek

671.0B

Best score:0.8 (GPQA)

Released:Jan 2025

Price:$0.27/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.