DeepSeek logo

DeepSeek-V2.5

DeepSeek

DeepSeek-V2.5 is an enhanced version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct, integrating general capabilities and coding skills. The model better aligns with human preferences and has been optimized in various aspects, including writing and instruction following.

Key Specifications

Parameters
236.0B
Context
8.2K
Release Date
May 8, 2024
Average Score
71.1%

Timeline

Key dates in the model's history
Announcement
May 8, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
236.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$2.00
Output (per 1M tokens)
$2.00
Max Input Tokens
8.2K
Max Output Tokens
8.2K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Score EvaluationSelf-reported
80.4%

Programming

Programming skills tests
HumanEval
Pass@1 Metric for determination probability that, that model correct solution with first attempts. ability model solve tasks with first attempts, without necessity several attempts and This metric especially important for evaluation performance models on complex mathematical and tasks, where only fully correct answers. Model with Pass@1 can more for applications, where is required accuracy with first attemptsSelf-reported
89.0%
SWE-Bench Verified
## Evaluation Evaluation — this measurement degree or quality answer model. We in mainly we use two type evaluations: evaluation on basis and evaluation experts-people. ### evaluation on basis evaluation on basis to tasks with answers or evaluation can be to such how "correctly/incorrectly", numerical points or evaluation by scale (for example, from 1 to 5). ### Evaluations experts For evaluation tasks, requiring or complex reasoning, we on evaluation experts-people. evaluate answers on basis which difficult such how nuances, accuracy, and other aspects quality ### accuracy accuracy how well well model gives correct answers on questions. Examples metrics accuracy include: - Accuracy: percentage correct answers. - F1-score: weighted accuracy and completeness. - : measure that, how well well model ### evaluation evaluation model by comparison with other models or They can include: - people: when people ask choose between answers different models. - : system where model or on basis that, how well well they work againstSelf-reported
16.8%

Mathematics

Mathematical problems and computations
GSM8k
ScoreSelf-reported
95.1%
MATH
# Evaluation We we use system evaluation, that, that is applied in by evaluation FrontierMath and Olympiad-Level GPQA. All solutions are evaluated by scale from 0 to 5: **5**: solution with errors. **4**: Partially correct solution with and only minor errors. **3**: Solution, progress in but with more errors. **2**: progress, but very from correct solutions. **1**: progress. **0**: All solutions were in evaluationwere by means ofSelf-reported
74.7%

Other Tests

Specialized benchmarks
Aider
Score We evaluation model by specific categories, which determines general score from 1 to 10. For each categories our process evaluation combines how scores benchmarks, so and analysis, on our Key evaluation: • and scores • performance over capabilities • score for comparison between models • from 1 to 10, where 10 represents capabilities Important note, that evaluation for comparison performance between models, and not for determination Evaluation 10 in categories reflects on performance, which can be modelsSelf-reported
72.2%
AlignBench
# Score This method evaluates results in with evaluation. ## Process 1. criteria evaluation. should be: - for tasks 2. evaluation with for each level performance. For example: - 1-3: 4-7: 8-10: Outperforms 3. results by each using 4. general score, using (for example, average value, ). 5. general score with for determination general performance. ## Example **Task**: ability model solve mathematical tasks. ****: - Correctness (1-10) - Efficiency method (1-10) - explanations (1-10) **evaluation**: (0.5 × Correctness) + (0.3 × Efficiency) + (0.2 × ) **values**: - <5: 5-7: 7-9: Good - >9: ## Limitations - in and values - all aspects performance - Can not aspects, which difficult measureSelf-reported
80.4%
AlpacaEval 2.0
Score This field includes training LLM solving tasks by means of numerical evaluations. We we consider two main approach: 1. points: Model directly value. For example, in task GPQA can ask model evaluate answer by scale from 1 to 5. 2. points: Model performs intermediate steps reasoning, which then in value. For example: - (0 or 1) - Computation in several with choice with Definition evaluation, and then evaluation by this criteria Models can also give on basis these evaluations, for example, several options. Advantages: - Allows models perform comparison without necessity decision-making solutions - more information about model - Can be more than simple Disadvantages: - Some model can numerical evaluation - interpret value specific numerical values - Not always which system evaluations for specific tasksSelf-reported
50.5%
Arena Hard
# Evaluation After that, how all tasks, their answers how correct or incorrect. Then we accuracy for each from 50 tasks and for each from for data, values results can be above or below, than "" values. For solutions this problems we in order to obtain more evaluation accuracy and For each we 1000 with from sample and accuracy by this Then we we compute 95% using method, where matches 2,5-and — 97,5-accuracy by For results we first we compute accuracy by all tasks for each sample, and then average and 95% these valuesSelf-reported
76.2%
BBH
Score AI: EvaluationSelf-reported
84.3%
DS-Arena-Code
## Evaluation Evaluations accuracy match answers model correct answers. In this correct answers presented in two : * **Answers with multiple choice**: We we compare answer model with correct answer, in order to whether they. * **Answers, requiring reasoning**: We answer, which matches correct answer, even if process solutions differs. For answers with multiple choice evaluation constitutes 1.0, if answer model matches correct answer, and 0.0 in case. For answers, requiring reasoning, evaluation constitutes 1.0, if answer model correct, and 0.0 in case. General evaluation — this average value all evaluations. In order to answers with multiple choice, we option answer, which in end answer model, for example, "answer: B". For answers, requiring reasoning, we answer in end answer model. If model provides answers or we not we can answer, evaluation constitutes 0.0Self-reported
63.1%
DS-FIM-Eval
# Score We developed system evaluation, which accounts for answers in form, degree necessity prompts and process our with model at solving problems. For each problems we we evaluate level help, which model for achievements correct answer. system evaluation: - **score (1.0)**: Model immediately provides correct answer without any-or prompts or additional questions. - **score (0.5)**: Model to correct answer, but with help (for example, prompts about errors or queries solution). This also includes cases, when model in its answer. - **score (0.0)**: Model not can to correct answer, even with help. This system evaluation ability model correct answers with more use, where can be for model on correctSelf-reported
78.3%
HumanEval-Mul
Pass@1 metric measures probability solutions tasks with first attempts. In order to her/its model generates n samples answers. is considered answer, with evaluation Pass@1 by means of evaluation probability that, that one randomly answer from n generated will correct. : 1. Model generates n samples answers for each tasks 2. Each is evaluated how correct or incorrect 3. success at one answer from n is calculated how c/n, where c - number correct answers 4. Pass@1 for total set tasks - this average value by all tasks This metric often is used for evaluation models in solving tasks programming and mathematical tasks. She/It shows, how well well model can generate correct answer with first attempts without capabilities corrections or solutionsSelf-reported
73.8%
LiveCodeBench(01-09)
## Score Score - this method analysis mathematical abilities models artificial intelligence through reasoning about possible and in their own We method training GPT-4, which we "Score", for execution such analysis without human. Score verifies its own solutions mathematical tasks, and logical errors, approaches and that answers all corresponding On set data mathematical tasks Score achieves accuracy 83.0%, significantly GPT-4 (58.0%) and (75.0%). details Score: 1. We with that, that GPT-4 solve task. 2. Then Score this solution. 3. Score possible problems and solutions. 4. Score verifies solution step for step and final answer. Score is its ability and correct errors, which from-for mathematical LLM. Score in degree outperforms methods with several solutions and approaches with maintaining at this high efficiency. This shows, that LLM can significantly improve its mathematical abilities, if them provide for reasoning about and in their ownSelf-reported
41.8%
MT-Bench
ScoreSelf-reported
90.2%

License & Metadata

License
deepseek
Announcement Date
May 8, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.