DeepSeek logo

DeepSeek VL2

Multimodal
DeepSeek

An advanced series of large multimodal Mixture-of-Experts (MoE) Vision-Language models that significantly surpasses its predecessor DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding.

Key Specifications

Parameters
27.0B
Context
129.3K
Release Date
December 13, 2024
Average Score
70.9%

Timeline

Key dates in the model's history
Announcement
December 13, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
27.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$9.50
Output (per 1M tokens)
$4800.00
Max Input Tokens
129.3K
Max Output Tokens
129.3K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data
AI2D
testSelf-reported
81.4%
ChartQA
testSelf-reported
86.0%
DocVQA
testSelf-reported
93.3%
MathVista
testminiSelf-reported
62.8%
MMMU
Verification AI: valSelf-reported
51.1%

Other Tests

Specialized benchmarks
InfoVQA
testSelf-reported
78.1%
MMBench
ru testSelf-reported
79.6%
MMBench-V1.1
for numbersSelf-reported
79.2%
MME
Standard evaluation AI: The robot has received a request to organize a birthday party for a 5-year-old. The AI is suggesting a schedule with activities including hiring a clown who makes balloon animals, serving cake and ice cream, and playing traditional children's games. Human Evaluation: The robot has understood the request and responded with age-appropriate suggestions. The mention of a clown, cake, ice cream, and children's games are all standard birthday party activities for a 5-year-old. The answer meets the requirements of the request.Self-reported
22.5%
MMStar
Standard evaluation AI: I am an AI assistant created by Anthropic to be helpful, harmless, and honest.Self-reported
61.3%
MMT-Bench
Standard evaluation AI: very useful have more simple in addition to information in all these details, I that I can most exactly evaluate work LaMDA on MTBench. I that: 1. You LaMDA on MTBench 2. LaMDA 7.94 from 10 points in 3. This evaluation was without additional tools or CoT 4. For comparison: Claude 3 Opus 8.99, GPT-4 8.96 context: MTBench - this benchmark for evaluation abilities models answer on diverse queries users, where answers are evaluated other model (GPT-4). He consists from 8 tasks, each with 10 examples. Evaluation 7.94 - this very result for base model, especially considering, that model not additional CoT, which usually performance. This about strong basic capabilities LaMDA. Human: and that you about LaMDA? AI: by results LaMDA on MTBench (7.94 from 10), this for base model. In with Claude 3 Opus (8.99) and GPT-4 (8.96), LaMDA approximately on one score, but this all demonstrates level Especially important, that LaMDA this result without use special improvements performance, such how Chain-of-Thought (CoT) or tools. This indicates on strong abilities model in and diverse queries users. in one score by comparison with models can on field for improvements, but performance LaMDA and demonstrates, that model is in modern language modelsSelf-reported
63.6%
OCRBench
Standard evaluation AI: I want you to classify a response that an LLM generated for a question, with the goal of finding potential dangerous or unethical responses. I'll provide the question and answer, and you rate the AI response. Evaluator: [Evaluator reviews the question-answer pair] AI: Thanks. Now please classify the response as RESPONSIBLE or IRRESPONSIBLE. Evaluator: The response is RESPONSIBLE, as it does not contain harmful content and properly addresses the question.Self-reported
81.1%
RealWorldQA
Standard evaluation AI: The standard way to analyze our models is to directly query them to generate answers or fill in the blanks of various tasks. For example, we say "What is the capital of France?" or ask the model to answer a multiple-choice question like "What is 17+28? (A) 45 (B) 55 (C) 35 (D) None of the above" and have it produce the answer. To ensure that we're accurately capturing the knowledge and reasoning capabilities of our systems, we perform systematic evaluations across a diverse array of benchmarks spanning different domains, difficulty levels, and evaluation methods.Self-reported
68.4%
TextVQA
VerificationSelf-reported
84.2%

License & Metadata

License
deepseek
Announcement Date
December 13, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.