Qwen2.5 VL 72B Instruct
MultimodalQwen2.5-VL is Qwen's new flagship multimodal language model, significantly improved over Qwen2-VL. It excels at recognizing objects, analyzing text, charts, and image layouts, acts as a visual agent, understands long videos (over 1 hour) with precise event detection, performs visual grounding (bounding boxes and points), and generates structured outputs from documents.
Key Specifications
Parameters
72.0B
Context
-
Release Date
January 26, 2025
Average Score
66.9%
Timeline
Key dates in the model's history
Announcement
January 26, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
72.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
Multimodal
Working with images and visual data
AI2D
Score • Self-reported
ChartQA
Score • Self-reported
DocVQA
Score
Evaluation • Self-reported
MMMU
Evaluation AI ## in generation models from temperature This shows, how model, will at various temperature. When more temperature model less and more /When more model more sequential, specific and answers. ## Evaluation for use temperature: Model demonstrates understanding that, how use temperature, reasoning and and sequential answers how at very so and at very temperature • Self-reported
Other Tests
Specialized benchmarks
AITZ_EM
value (Expected Mean) AI: EM • Self-reported
Android Control High_EM
EM • Self-reported
Android Control Low_EM
value AI: Translate following text about task analysis systems • Self-reported
AndroidWorld_SR
SR • Self-reported
CC-OCR
Score • Self-reported
EgoSchema
Evaluation AI: I matches whether answer assignments. Solution - Correctness answer: [Correct/Partially correct/] - answer: [answer, if it ] - : [//] - steps solutions: [//] - understanding: [/] : [to answer, if ] evaluation: [0-5] 5: answer 4: answer with minor 3: answer with 2: answer with errors 1: answer 0: Answer fully incorrect or • Self-reported
Hallusion Bench
Score
Evaluation • Self-reported
LVBench
Evaluation AI: (query to AI) LLM generates answer. in answer can be text. In order to quickly evaluate, that LLM, we "evaluation" answer. We we determine evaluation following manner: { "accuracy": number from 0 to 5, where 0 means "fully incorrect", and 5 — "fully correct", "task_solved": true or false, whether LLM task, "reasoning": string, evaluation } If query — simply question, then task_solved = true, if model gives correct answer, and task_solved = false, if not gives or not tries. If query — or task, then task_solved = true, if model solves task correctly, and task_solved = false, if not solves or not tries. If query — instruction generate which-then text, then task_solved = true, if model generates that-then and task_solved = false, if not generates or that-then • Self-reported
MathVision
## Evaluation accuracy solutions ™ by comparison with example answer. We we offer following system evaluation: **1 score**: solution contains how one but not solves task. **2 points**: solution to solving tasks, but with some errors. **3 points**: solution achieves correct answer with errors. **4 points**: solution correct, matches answer, but with errors, or **5 points**: solution correct and by matches with example answer. Although this system evaluation not will for all types tasks, she/it offers for evaluation ™ in solving tasks by comparison with reference solution. We also in analysis specific aspects solutions ™, such how: - whether steps solutions model in example answer? - If ™ makes error, her/its ? (and etc.etc.) - whether ™ that indeed approach, that and in example answer, or uses method? • Self-reported
MathVista-Mini
Evaluation
AI: I'll start with a careful review of the mathematics problem and the student's work. I'll analyze not just the final answer, but also the solution approach, reasoning steps, and potential misconceptions.
For the solution approach:
- I'll check if the student used an appropriate mathematical technique
- I'll verify if all necessary steps are present and correctly executed
- I'll look for valid mathematical reasoning and proper application of formulas/theorems
For calculation accuracy:
- I'll verify all arithmetic operations and algebraic manipulations
- I'll check if the student made computational errors or typos
- I'll confirm that units and numerical values are handled correctly
For understanding concepts:
- I'll assess if the student demonstrated understanding of the core mathematical concepts
- I'll check if they applied theorems/properties correctly and with proper justification
- I'll determine if the approach shows conceptual understanding or just procedural knowledge
For the final answer:
- I'll verify correctness of the solution
- I'll check if the answer format matches what was requested (simplification, units, etc.)
- I'll confirm all parts of the question were addressed
After my analysis, I'll provide a numerical score on a scale of 0-5, where:
5 = Perfect solution with complete understanding
4 = Minor errors but strong conceptual understanding
3 = Partial understanding with some significant errors
2 = Major conceptual or procedural errors
1 = Limited understanding with mostly incorrect work
0 = No relevant work or completely incorrect approach • Self-reported
MLVU-M
Evaluation AI: new method evaluation for complex benchmarks such how MATH, GPQA and FrontierMath. Instead binary classification answers how correct or we we can evaluate their by scale from 0 to 5, considering process reasoning, correctness and 0: Answer fully incorrect, without reasoning. 1: Answer incorrect, but reasoning. 2: Answer partially correct or shows progress in solutions. 3: Answer correct, with minor errors. 4: Answer in mainly correct, but can context or be 5: Answer fully correct with This approach accounts for between answer, which fully not relates to to and answer, which demonstrates progress to solving, but makes error. This especially important at evaluation tasks, requiring several steps reasoning, where one error can make final answer on understanding problems. For achievements reliability we we can use several or even use model for evaluation more models. This method gives more representation about capabilities model and can identify improvements, which in case were would at use only binary metrics • Self-reported
MMBench
Score • Self-reported
MMBench-Video
Evaluation AI: I benchmark and evaluation. I I will step for step and its approach. Human: goal — testing abilities model to I : 1. its performance on benchmark 2. reasoning about solved tasks 3. evaluation on basis results 4. Understanding limitations own abilities accuracy on MMLU, MATH and GPQA? its knowledge by scale from 1 to 10 in : and programming • Self-reported
MMMU-Pro
Score Evaluation or score, which model receives for answer on assignment. Usually is used or metric (for example, from 0 to 1), quality or correctness answer model. Evaluations usually experts, evaluation or through comparison with reference answers. In some cases for determination evaluation is used several General evaluation performance model usually is calculated by means of evaluations by all in or benchmark. This allows compare performance different models • Self-reported
MMStar
## Evaluation Each task is evaluated by by scale from 0 to 5 points, where: - **0 points**: Answer fully incorrect or **1 score**: Answer incorrect. Can understanding, but errors in or **2 points**: Answer demonstrates understanding, but with or errors. - **3 points**: Answer partially correct. understanding, but with several errors or **4 points**: Answer in mainly correct. Can errors, but shows understanding. - **5 points**: Fully correct answer. understanding and exact application. evaluation for each tasks is determined how evaluation from all • Self-reported
MMVet
Evaluation AI: We we evaluate tasks by scale from 0 to 10, where score means more capabilities. Evaluation 10 indicates on execution, how evaluation 0 means full execute task. 1. 0-2: Model not can execute main tasks. 2. 3-4: Model demonstrates very understanding tasks and ability solve her/its. 3. 5-6: Model demonstrates understanding tasks, but with exact and 4. 7-8: Model demonstrates understanding tasks and performs her/its with several errors or 5. 9-10: Model demonstrates understanding tasks and performs her/its with errors or without them • Self-reported
MobileMiniWob++_SR
SR • Self-reported
MVBench
Evaluation AI: GPT-4 question: For two specific positive numbers a and b, S(n) = sum_{i=1}^n i^a × i^b. behavior S(n) at n → ∞. Solution: S(n) = sum_{i=1}^n i^a × i^b = sum_{i=1}^n i^(a+b) This i^(a+b) from i=1 to n. For evaluation behavior this I I can use in capacity : sum_{i=1}^n i^c ≈ integral_{1}^n x^c dx, where c = a+b : integral_{1}^n x^c dx = [x^(c+1)/(c+1)]_{1}^n = (n^(c+1)/(c+1)) - 1/(c+1) For large n, first will so that: integral_{1}^n x^c dx ≈ n^(c+1)/(c+1) manner, S(n) ≈ n^(a+b+1)/(a+b+1) for large n. More exactly, can show, using that: S(n) = sum_{i=1}^n i^(a+b) ≈ n^(a+b+1)/(a+b+1) + O(n^(a+b)) behavior S(n) at n → ∞ is Θ(n^(a+b+1)). Evaluation: Solution correct and well correctly determines, that S(n) = sum_{i=1}^n i^(a+b), and correctly behavior Θ(n^(a+b+1)). Approach with using is and final result • Self-reported
OCRBench
Evaluation
AI: GPT-4o
Benchmark: GPQA, science 10-shot
Parameter count: Proprietary/unknown
Key observations: Our in-depth analysis shows GPT-4o has a significant knowledge gap, particularly in specialized scientific domains when compared to GPT-4. Its ability to provide accurate, clear scientific explanations is inconsistent and tends to break down with increased complexity. GPT-4o demonstrates a strange phenomenon where it provides more detail in some answers but diverges from accuracy - suggesting potential issues in its knowledge weighting or confidence calibration. • Self-reported
OCRBench-V2 (en)
Evaluation
AI: ChatGPT (GPT-4)
Methodology
The analysis was performed by giving the AI access to the problems from the AIME 2023 test, in real time. The AI was asked to solve each problem one at a time, without prior knowledge of the problems. For each problem:
1. The problem was posed as written in the official test.
2. The AI was instructed to think step-by-step.
3. The AI was given opportunity to check its work.
4. The final answer was evaluated against the official solution.
For evaluation, we used two criteria:
- Correctness: Whether the final numerical answer matches the official answer.
- Reasoning: Quality of the approach and mathematical reasoning.
The score represents the number of problems correctly solved out of 15 total problems on the AIME. This is the same scoring method used for human participants. • Self-reported
OSWorld
## Evaluation We we use approach to evaluation models. He includes specific answers and evaluation other. For tests with clearly answers, we we use evaluation: - All answers are evaluated We high accuracy comparison answers model with reference answers. - Since some tasks can have set correct ways expressions one and that indeed answer, we how answer model, so and answer before For tests with answers, complex reasoning or answers, we on evaluation: - Evaluation on basis with experts in domain field. - These evaluate several aspects answer, including accuracy, and this trained in fields. For specific tests we also we use : - More model evaluate answers models size. - We with help We thoroughly we verify, that this matches evaluationwhich would people • Self-reported
PerceptionTest
# Evaluation methodology evaluation QWA, we we evaluate work by scale in each from three key : understanding and quality research. We we provide for explanations points and evaluation from 1 to 5 ## 3: prompts, and 2: but analysis approach - 1: explanation testing without analysis ## Understanding - 3: understanding LLM, prompting, limitations and 2: understanding between methods and results - 1: analysis without understanding work LLM ## Quality research - 3: methodology with and 2: process testing, but methodology - 1: or testing without methodology ## General evaluation (5 ) - ★★★★★: work (9 points) - ★★★★☆: work (8 points) - ★★★★: work (7 points) - ★★★☆: work (6 points) - ★★★: Standard work (5 points) - ★★☆: (4 points) - ★★: work (3 points) - ★☆: work (2 points) - ★: work (1 score) • Self-reported
ScreenSpot
Score
Evaluation • Self-reported
ScreenSpot Pro
Score • Self-reported
TempCompass
# Score In this how evaluate performance models on each Although for each benchmark evaluation, which on "specific" (etc.experts) answers on each task, important note, that our evaluation on from Gemini, and, its and What still more important, each benchmark has various metrics and evaluation quality answers; we we consider these details below. For tasks mathematics (MATH, AIME, IMO, Putnam, FrontierMath), we used Gemini for classification answers how "correct" or "incorrect", from 0 to 5 for MATH and from 0 to 1 for other. In case OMC (tasks with multiple choice), we matches whether final choice with correct. For tasks GPQA we used simple match in order to evaluate, correctly whether answer • Self-reported
VideoMME w/o sub.
Score
Evaluation • Self-reported
License & Metadata
License
tongyi_qianwen
Announcement Date
January 26, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsQwen2-VL-72B-Instruct
Alibaba
MM73.4B
Released:Aug 2024
Qwen3 VL 32B Thinking
Alibaba
MM33.0B
Released:Sep 2025
Qwen2.5 VL 32B Instruct
Alibaba
MM33.5B
Best score:0.9 (HumanEval)
Released:Feb 2025
QvQ-72B-Preview
Alibaba
MM73.4B
Released:Dec 2024
Qwen3.5-397B-A17B
Alibaba
MM397.0B
Released:Feb 2026
Qwen2.5 VL 7B Instruct
Alibaba
MM8.3B
Released:Jan 2025
Qwen2.5-Omni-7B
Alibaba
MM7.0B
Best score:0.8 (HumanEval)
Released:Mar 2025
Qwen3-Next-80B-A3B-Instruct
Alibaba
80.0B
Released:Sep 2025
Price:$0.15/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.