DeepSeek VL2

Name: DeepSeek VL2
Author: DeepSeek

Multimodal

DeepSeek

An advanced series of large multimodal Mixture-of-Experts (MoE) Vision-Language models that significantly surpasses its predecessor DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding.

Key Specifications

Parameters

27.0B

Context

129.3K

Release Date

December 13, 2024

Average Score

70.9%

API Documentation Research Paper Repository Model Weights

Timeline

Key dates in the model's history

Announcement

December 13, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

27.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$9.50

Output (per 1M tokens)

$4800.00

Max Input Tokens

129.3K

Max Output Tokens

129.3K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data

AI2D

test • Self-reported

81.4%

ChartQA

test • Self-reported

86.0%

DocVQA

test • Self-reported

93.3%

MathVista

testmini • Self-reported

62.8%

MMMU

Verification AI: val • Self-reported

51.1%

Other Tests

Specialized benchmarks

InfoVQA

test • Self-reported

78.1%

MMBench

ru test • Self-reported

79.6%

MMBench-V1.1

for numbers • Self-reported

79.2%

MME

Standard evaluation AI: The robot has received a request to organize a birthday party for a 5-year-old. The AI is suggesting a schedule with activities including hiring a clown who makes balloon animals, serving cake and ice cream, and playing traditional children's games. Human Evaluation: The robot has understood the request and responded with age-appropriate suggestions. The mention of a clown, cake, ice cream, and children's games are all standard birthday party activities for a 5-year-old. The answer meets the requirements of the request. • Self-reported

22.5%

MMStar

Standard evaluation AI: I am an AI assistant created by Anthropic to be helpful, harmless, and honest. • Self-reported

61.3%

MMT-Bench

Standard evaluation AI: very useful have more simple in addition to information in all these details, I that I can most exactly evaluate work LaMDA on MTBench. I that: 1. You LaMDA on MTBench 2. LaMDA 7.94 from 10 points in 3. This evaluation was without additional tools or CoT 4. For comparison: Claude 3 Opus 8.99, GPT-4 8.96 context: MTBench - this benchmark for evaluation abilities models answer on diverse queries users, where answers are evaluated other model (GPT-4). He consists from 8 tasks, each with 10 examples. Evaluation 7.94 - this very result for base model, especially considering, that model not additional CoT, which usually performance. This about strong basic capabilities LaMDA. Human: and that you about LaMDA? AI: by results LaMDA on MTBench (7.94 from 10), this for base model. In with Claude 3 Opus (8.99) and GPT-4 (8.96), LaMDA approximately on one score, but this all demonstrates level Especially important, that LaMDA this result without use special improvements performance, such how Chain-of-Thought (CoT) or tools. This indicates on strong abilities model in and diverse queries users. in one score by comparison with models can on field for improvements, but performance LaMDA and demonstrates, that model is in modern language models • Self-reported

63.6%

OCRBench

Standard evaluation AI: I want you to classify a response that an LLM generated for a question, with the goal of finding potential dangerous or unethical responses. I'll provide the question and answer, and you rate the AI response. Evaluator: [Evaluator reviews the question-answer pair] AI: Thanks. Now please classify the response as RESPONSIBLE or IRRESPONSIBLE. Evaluator: The response is RESPONSIBLE, as it does not contain harmful content and properly addresses the question. • Self-reported

81.1%

RealWorldQA

Standard evaluation AI: The standard way to analyze our models is to directly query them to generate answers or fill in the blanks of various tasks. For example, we say "What is the capital of France?" or ask the model to answer a multiple-choice question like "What is 17+28? (A) 45 (B) 55 (C) 35 (D) None of the above" and have it produce the answer. To ensure that we're accurately capturing the knowledge and reasoning capabilities of our systems, we perform systematic evaluations across a diverse array of benchmarks spanning different domains, difficulty levels, and evaluation methods. • Self-reported

68.4%

TextVQA

Verification • Self-reported

84.2%

License & Metadata

License

deepseek

Announcement Date

December 13, 2024

Last Updated

July 19, 2025

Similar Models

All Models

DeepSeek VL2 Small

DeepSeek

MM16.0B

Released:Dec 2024

DeepSeek VL2 Tiny

DeepSeek

MM3.0B

Released:Dec 2024

DeepSeek R1 Distill Qwen 14B

DeepSeek

14.8B

Best score:0.6 (GPQA)

Released:Jan 2025

DeepSeek R1 Distill Llama 70B

DeepSeek

70.6B

Best score:0.7 (GPQA)

Released:Jan 2025

Price:$0.10/1M tokens

DeepSeek R1 Distill Qwen 32B

DeepSeek

32.8B

Best score:0.6 (GPQA)

Released:Jan 2025

Price:$0.12/1M tokens

Llama 3.2 90B Instruct

Gemma 3 27B

Google

MM27.0B

Best score:0.9 (HumanEval)

Released:Mar 2025

Price:$0.11/1M tokens

Mistral Small 3.1 24B Base

Mistral AI

MM24.0B

Best score:0.8 (MMLU)

Released:Mar 2025

Price:$0.10/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.