Qwen2.5 VL 72B Instruct

Name: Qwen2.5 VL 72B Instruct
Author: Alibaba

Multimodal

Alibaba

Qwen2.5-VL is Qwen's new flagship multimodal language model, significantly improved over Qwen2-VL. It excels at recognizing objects, analyzing text, charts, and image layouts, acts as a visual agent, understands long videos (over 1 hour) with precise event detection, performs visual grounding (bounding boxes and points), and generates structured outputs from documents.

Key Specifications

Parameters

72.0B

Context

Release Date

January 26, 2025

Average Score

66.9%

Research Paper Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

January 26, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

72.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data

AI2D

Score • Self-reported

88.4%

ChartQA

Score • Self-reported

89.5%

DocVQA

Score Evaluation • Self-reported

96.4%

MMMU

Evaluation AI ## in generation models from temperature This shows, how model, will at various temperature. When more temperature model less and more /When more model more sequential, specific and answers. ## Evaluation for use temperature: Model demonstrates understanding that, how use temperature, reasoning and and sequential answers how at very so and at very temperature • Self-reported

70.2%

Other Tests

Specialized benchmarks

AITZ_EM

value (Expected Mean) AI: EM • Self-reported

83.2%

Android Control High_EM

EM • Self-reported

67.4%

Android Control Low_EM

value AI: Translate following text about task analysis systems • Self-reported

93.7%

AndroidWorld_SR

SR • Self-reported

35.0%

CC-OCR

Score • Self-reported

79.8%

EgoSchema

Evaluation AI: I matches whether answer assignments. Solution - Correctness answer: [Correct/Partially correct/] - answer: [answer, if it ] - : [//] - steps solutions: [//] - understanding: [/] : [to answer, if ] evaluation: [0-5] 5: answer 4: answer with minor 3: answer with 2: answer with errors 1: answer 0: Answer fully incorrect or • Self-reported

76.2%

Hallusion Bench

Score Evaluation • Self-reported

55.2%

LVBench

Evaluation AI: (query to AI) LLM generates answer. in answer can be text. In order to quickly evaluate, that LLM, we "evaluation" answer. We we determine evaluation following manner: { "accuracy": number from 0 to 5, where 0 means "fully incorrect", and 5 — "fully correct", "task_solved": true or false, whether LLM task, "reasoning": string, evaluation } If query — simply question, then task_solved = true, if model gives correct answer, and task_solved = false, if not gives or not tries. If query — or task, then task_solved = true, if model solves task correctly, and task_solved = false, if not solves or not tries. If query — instruction generate which-then text, then task_solved = true, if model generates that-then and task_solved = false, if not generates or that-then • Self-reported

47.3%

MathVision

## Evaluation accuracy solutions ™ by comparison with example answer. We we offer following system evaluation: **1 score**: solution contains how one but not solves task. **2 points**: solution to solving tasks, but with some errors. **3 points**: solution achieves correct answer with errors. **4 points**: solution correct, matches answer, but with errors, or **5 points**: solution correct and by matches with example answer. Although this system evaluation not will for all types tasks, she/it offers for evaluation ™ in solving tasks by comparison with reference solution. We also in analysis specific aspects solutions ™, such how: - whether steps solutions model in example answer? - If ™ makes error, her/its ? (and etc.etc.) - whether ™ that indeed approach, that and in example answer, or uses method? • Self-reported

38.1%

MathVista-Mini

Evaluation AI: I'll start with a careful review of the mathematics problem and the student's work. I'll analyze not just the final answer, but also the solution approach, reasoning steps, and potential misconceptions. For the solution approach: - I'll check if the student used an appropriate mathematical technique - I'll verify if all necessary steps are present and correctly executed - I'll look for valid mathematical reasoning and proper application of formulas/theorems For calculation accuracy: - I'll verify all arithmetic operations and algebraic manipulations - I'll check if the student made computational errors or typos - I'll confirm that units and numerical values are handled correctly For understanding concepts: - I'll assess if the student demonstrated understanding of the core mathematical concepts - I'll check if they applied theorems/properties correctly and with proper justification - I'll determine if the approach shows conceptual understanding or just procedural knowledge For the final answer: - I'll verify correctness of the solution - I'll check if the answer format matches what was requested (simplification, units, etc.) - I'll confirm all parts of the question were addressed After my analysis, I'll provide a numerical score on a scale of 0-5, where: 5 = Perfect solution with complete understanding 4 = Minor errors but strong conceptual understanding 3 = Partial understanding with some significant errors 2 = Major conceptual or procedural errors 1 = Limited understanding with mostly incorrect work 0 = No relevant work or completely incorrect approach • Self-reported

74.8%

MLVU-M

Evaluation AI: new method evaluation for complex benchmarks such how MATH, GPQA and FrontierMath. Instead binary classification answers how correct or we we can evaluate their by scale from 0 to 5, considering process reasoning, correctness and 0: Answer fully incorrect, without reasoning. 1: Answer incorrect, but reasoning. 2: Answer partially correct or shows progress in solutions. 3: Answer correct, with minor errors. 4: Answer in mainly correct, but can context or be 5: Answer fully correct with This approach accounts for between answer, which fully not relates to to and answer, which demonstrates progress to solving, but makes error. This especially important at evaluation tasks, requiring several steps reasoning, where one error can make final answer on understanding problems. For achievements reliability we we can use several or even use model for evaluation more models. This method gives more representation about capabilities model and can identify improvements, which in case were would at use only binary metrics • Self-reported

74.6%

MMBench

Score • Self-reported

88.0%

MMBench-Video

Evaluation AI: I benchmark and evaluation. I I will step for step and its approach. Human: goal — testing abilities model to I : 1. its performance on benchmark 2. reasoning about solved tasks 3. evaluation on basis results 4. Understanding limitations own abilities accuracy on MMLU, MATH and GPQA? its knowledge by scale from 1 to 10 in : and programming • Self-reported

2.0%

MMMU-Pro

Score Evaluation or score, which model receives for answer on assignment. Usually is used or metric (for example, from 0 to 1), quality or correctness answer model. Evaluations usually experts, evaluation or through comparison with reference answers. In some cases for determination evaluation is used several General evaluation performance model usually is calculated by means of evaluations by all in or benchmark. This allows compare performance different models • Self-reported

51.1%

MMStar

## Evaluation Each task is evaluated by by scale from 0 to 5 points, where: - **0 points**: Answer fully incorrect or **1 score**: Answer incorrect. Can understanding, but errors in or **2 points**: Answer demonstrates understanding, but with or errors. - **3 points**: Answer partially correct. understanding, but with several errors or **4 points**: Answer in mainly correct. Can errors, but shows understanding. - **5 points**: Fully correct answer. understanding and exact application. evaluation for each tasks is determined how evaluation from all • Self-reported

70.8%

MMVet

Evaluation AI: We we evaluate tasks by scale from 0 to 10, where score means more capabilities. Evaluation 10 indicates on execution, how evaluation 0 means full execute task. 1. 0-2: Model not can execute main tasks. 2. 3-4: Model demonstrates very understanding tasks and ability solve her/its. 3. 5-6: Model demonstrates understanding tasks, but with exact and 4. 7-8: Model demonstrates understanding tasks and performs her/its with several errors or 5. 9-10: Model demonstrates understanding tasks and performs her/its with errors or without them • Self-reported

76.2%

MobileMiniWob++_SR

SR • Self-reported

68.0%

MVBench

Evaluation AI: GPT-4 question: For two specific positive numbers a and b, S(n) = sum_{i=1}^n i^a × i^b. behavior S(n) at n → ∞. Solution: S(n) = sum_{i=1}^n i^a × i^b = sum_{i=1}^n i^(a+b) This i^(a+b) from i=1 to n. For evaluation behavior this I I can use in capacity : sum_{i=1}^n i^c ≈ integral_{1}^n x^c dx, where c = a+b : integral_{1}^n x^c dx = [x^(c+1)/(c+1)]_{1}^n = (n^(c+1)/(c+1)) - 1/(c+1) For large n, first will so that: integral_{1}^n x^c dx ≈ n^(c+1)/(c+1) manner, S(n) ≈ n^(a+b+1)/(a+b+1) for large n. More exactly, can show, using that: S(n) = sum_{i=1}^n i^(a+b) ≈ n^(a+b+1)/(a+b+1) + O(n^(a+b)) behavior S(n) at n → ∞ is Θ(n^(a+b+1)). Evaluation: Solution correct and well correctly determines, that S(n) = sum_{i=1}^n i^(a+b), and correctly behavior Θ(n^(a+b+1)). Approach with using is and final result • Self-reported

70.4%

OCRBench

Evaluation AI: GPT-4o Benchmark: GPQA, science 10-shot Parameter count: Proprietary/unknown Key observations: Our in-depth analysis shows GPT-4o has a significant knowledge gap, particularly in specialized scientific domains when compared to GPT-4. Its ability to provide accurate, clear scientific explanations is inconsistent and tends to break down with increased complexity. GPT-4o demonstrates a strange phenomenon where it provides more detail in some answers but diverges from accuracy - suggesting potential issues in its knowledge weighting or confidence calibration. • Self-reported

88.5%

OCRBench-V2 (en)

Evaluation AI: ChatGPT (GPT-4) Methodology The analysis was performed by giving the AI access to the problems from the AIME 2023 test, in real time. The AI was asked to solve each problem one at a time, without prior knowledge of the problems. For each problem: 1. The problem was posed as written in the official test. 2. The AI was instructed to think step-by-step. 3. The AI was given opportunity to check its work. 4. The final answer was evaluated against the official solution. For evaluation, we used two criteria: - Correctness: Whether the final numerical answer matches the official answer. - Reasoning: Quality of the approach and mathematical reasoning. The score represents the number of problems correctly solved out of 15 total problems on the AIME. This is the same scoring method used for human participants. • Self-reported

61.5%

OSWorld

## Evaluation We we use approach to evaluation models. He includes specific answers and evaluation other. For tests with clearly answers, we we use evaluation: - All answers are evaluated We high accuracy comparison answers model with reference answers. - Since some tasks can have set correct ways expressions one and that indeed answer, we how answer model, so and answer before For tests with answers, complex reasoning or answers, we on evaluation: - Evaluation on basis with experts in domain field. - These evaluate several aspects answer, including accuracy, and this trained in fields. For specific tests we also we use : - More model evaluate answers models size. - We with help We thoroughly we verify, that this matches evaluationwhich would people • Self-reported

8.8%

PerceptionTest

# Evaluation methodology evaluation QWA, we we evaluate work by scale in each from three key : understanding and quality research. We we provide for explanations points and evaluation from 1 to 5 ## 3: prompts, and 2: but analysis approach - 1: explanation testing without analysis ## Understanding - 3: understanding LLM, prompting, limitations and 2: understanding between methods and results - 1: analysis without understanding work LLM ## Quality research - 3: methodology with and 2: process testing, but methodology - 1: or testing without methodology ## General evaluation (5 ) - ★★★★★: work (9 points) - ★★★★☆: work (8 points) - ★★★★: work (7 points) - ★★★☆: work (6 points) - ★★★: Standard work (5 points) - ★★☆: (4 points) - ★★: work (3 points) - ★☆: work (2 points) - ★: work (1 score) • Self-reported

73.2%

ScreenSpot

Score Evaluation • Self-reported

87.1%

ScreenSpot Pro

Score • Self-reported

43.6%

TempCompass

# Score In this how evaluate performance models on each Although for each benchmark evaluation, which on "specific" (etc.experts) answers on each task, important note, that our evaluation on from Gemini, and, its and What still more important, each benchmark has various metrics and evaluation quality answers; we we consider these details below. For tasks mathematics (MATH, AIME, IMO, Putnam, FrontierMath), we used Gemini for classification answers how "correct" or "incorrect", from 0 to 5 for MATH and from 0 to 1 for other. In case OMC (tasks with multiple choice), we matches whether final choice with correct. For tasks GPQA we used simple match in order to evaluate, correctly whether answer • Self-reported

74.8%

VideoMME w/o sub.

Score Evaluation • Self-reported

73.3%

License & Metadata

License

tongyi_qianwen

Announcement Date

January 26, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Qwen2-VL-72B-Instruct

Alibaba

MM73.4B

Released:Aug 2024

Qwen3 VL 32B Thinking

Alibaba

MM33.0B

Released:Sep 2025

Qwen2.5 VL 32B Instruct

Alibaba

MM33.5B

Best score:0.9 (HumanEval)

Released:Feb 2025

QvQ-72B-Preview

Alibaba

MM73.4B

Released:Dec 2024

Qwen3.5-397B-A17B

Alibaba

MM397.0B

Released:Feb 2026

Qwen2.5 VL 7B Instruct

Alibaba

MM8.3B

Released:Jan 2025

Qwen2.5-Omni-7B

Alibaba

MM7.0B

Best score:0.8 (HumanEval)

Released:Mar 2025

Qwen3-Next-80B-A3B-Instruct

Alibaba

80.0B

Released:Sep 2025

Price:$0.15/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.