Alibaba logo

Qwen2.5 VL 72B Instruct

Multimodal
Alibaba

Qwen2.5-VL is Qwen's new flagship multimodal language model, significantly improved over Qwen2-VL. It excels at recognizing objects, analyzing text, charts, and image layouts, acts as a visual agent, understands long videos (over 1 hour) with precise event detection, performs visual grounding (bounding boxes and points), and generates structured outputs from documents.

Key Specifications

Parameters
72.0B
Context
-
Release Date
January 26, 2025
Average Score
66.9%

Timeline

Key dates in the model's history
Announcement
January 26, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
72.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data
AI2D
ScoreSelf-reported
88.4%
ChartQA
ScoreSelf-reported
89.5%
DocVQA
Score EvaluationSelf-reported
96.4%
MMMU
Evaluation AI ## in generation models from temperature This shows, how model, will at various temperature. When more temperature model less and more /When more model more sequential, specific and answers. ## Evaluation for use temperature: Model demonstrates understanding that, how use temperature, reasoning and and sequential answers how at very so and at very temperatureSelf-reported
70.2%

Other Tests

Specialized benchmarks
AITZ_EM
value (Expected Mean) AI: EMSelf-reported
83.2%
Android Control High_EM
EMSelf-reported
67.4%
Android Control Low_EM
value AI: Translate following text about task analysis systemsSelf-reported
93.7%
AndroidWorld_SR
SRSelf-reported
35.0%
CC-OCR
ScoreSelf-reported
79.8%
EgoSchema
Evaluation AI: I matches whether answer assignments. Solution - Correctness answer: [Correct/Partially correct/] - answer: [answer, if it ] - : [//] - steps solutions: [//] - understanding: [/] : [to answer, if ] evaluation: [0-5] 5: answer 4: answer with minor 3: answer with 2: answer with errors 1: answer 0: Answer fully incorrect orSelf-reported
76.2%
Hallusion Bench
Score EvaluationSelf-reported
55.2%
LVBench
Evaluation AI: (query to AI) LLM generates answer. in answer can be text. In order to quickly evaluate, that LLM, we "evaluation" answer. We we determine evaluation following manner: { "accuracy": number from 0 to 5, where 0 means "fully incorrect", and 5 — "fully correct", "task_solved": true or false, whether LLM task, "reasoning": string, evaluation } If query — simply question, then task_solved = true, if model gives correct answer, and task_solved = false, if not gives or not tries. If query — or task, then task_solved = true, if model solves task correctly, and task_solved = false, if not solves or not tries. If query — instruction generate which-then text, then task_solved = true, if model generates that-then and task_solved = false, if not generates or that-thenSelf-reported
47.3%
MathVision
## Evaluation accuracy solutions ™ by comparison with example answer. We we offer following system evaluation: **1 score**: solution contains how one but not solves task. **2 points**: solution to solving tasks, but with some errors. **3 points**: solution achieves correct answer with errors. **4 points**: solution correct, matches answer, but with errors, or **5 points**: solution correct and by matches with example answer. Although this system evaluation not will for all types tasks, she/it offers for evaluation ™ in solving tasks by comparison with reference solution. We also in analysis specific aspects solutions ™, such how: - whether steps solutions model in example answer? - If ™ makes error, her/its ? (and etc.etc.) - whether ™ that indeed approach, that and in example answer, or uses method?Self-reported
38.1%
MathVista-Mini
Evaluation AI: I'll start with a careful review of the mathematics problem and the student's work. I'll analyze not just the final answer, but also the solution approach, reasoning steps, and potential misconceptions. For the solution approach: - I'll check if the student used an appropriate mathematical technique - I'll verify if all necessary steps are present and correctly executed - I'll look for valid mathematical reasoning and proper application of formulas/theorems For calculation accuracy: - I'll verify all arithmetic operations and algebraic manipulations - I'll check if the student made computational errors or typos - I'll confirm that units and numerical values are handled correctly For understanding concepts: - I'll assess if the student demonstrated understanding of the core mathematical concepts - I'll check if they applied theorems/properties correctly and with proper justification - I'll determine if the approach shows conceptual understanding or just procedural knowledge For the final answer: - I'll verify correctness of the solution - I'll check if the answer format matches what was requested (simplification, units, etc.) - I'll confirm all parts of the question were addressed After my analysis, I'll provide a numerical score on a scale of 0-5, where: 5 = Perfect solution with complete understanding 4 = Minor errors but strong conceptual understanding 3 = Partial understanding with some significant errors 2 = Major conceptual or procedural errors 1 = Limited understanding with mostly incorrect work 0 = No relevant work or completely incorrect approachSelf-reported
74.8%
MLVU-M
Evaluation AI: new method evaluation for complex benchmarks such how MATH, GPQA and FrontierMath. Instead binary classification answers how correct or we we can evaluate their by scale from 0 to 5, considering process reasoning, correctness and 0: Answer fully incorrect, without reasoning. 1: Answer incorrect, but reasoning. 2: Answer partially correct or shows progress in solutions. 3: Answer correct, with minor errors. 4: Answer in mainly correct, but can context or be 5: Answer fully correct with This approach accounts for between answer, which fully not relates to to and answer, which demonstrates progress to solving, but makes error. This especially important at evaluation tasks, requiring several steps reasoning, where one error can make final answer on understanding problems. For achievements reliability we we can use several or even use model for evaluation more models. This method gives more representation about capabilities model and can identify improvements, which in case were would at use only binary metricsSelf-reported
74.6%
MMBench
ScoreSelf-reported
88.0%
MMBench-Video
Evaluation AI: I benchmark and evaluation. I I will step for step and its approach. Human: goal — testing abilities model to I : 1. its performance on benchmark 2. reasoning about solved tasks 3. evaluation on basis results 4. Understanding limitations own abilities accuracy on MMLU, MATH and GPQA? its knowledge by scale from 1 to 10 in : and programmingSelf-reported
2.0%
MMMU-Pro
Score Evaluation or score, which model receives for answer on assignment. Usually is used or metric (for example, from 0 to 1), quality or correctness answer model. Evaluations usually experts, evaluation or through comparison with reference answers. In some cases for determination evaluation is used several General evaluation performance model usually is calculated by means of evaluations by all in or benchmark. This allows compare performance different modelsSelf-reported
51.1%
MMStar
## Evaluation Each task is evaluated by by scale from 0 to 5 points, where: - **0 points**: Answer fully incorrect or **1 score**: Answer incorrect. Can understanding, but errors in or **2 points**: Answer demonstrates understanding, but with or errors. - **3 points**: Answer partially correct. understanding, but with several errors or **4 points**: Answer in mainly correct. Can errors, but shows understanding. - **5 points**: Fully correct answer. understanding and exact application. evaluation for each tasks is determined how evaluation from allSelf-reported
70.8%
MMVet
Evaluation AI: We we evaluate tasks by scale from 0 to 10, where score means more capabilities. Evaluation 10 indicates on execution, how evaluation 0 means full execute task. 1. 0-2: Model not can execute main tasks. 2. 3-4: Model demonstrates very understanding tasks and ability solve her/its. 3. 5-6: Model demonstrates understanding tasks, but with exact and 4. 7-8: Model demonstrates understanding tasks and performs her/its with several errors or 5. 9-10: Model demonstrates understanding tasks and performs her/its with errors or without themSelf-reported
76.2%
MobileMiniWob++_SR
SRSelf-reported
68.0%
MVBench
Evaluation AI: GPT-4 question: For two specific positive numbers a and b, S(n) = sum_{i=1}^n i^a × i^b. behavior S(n) at n → ∞. Solution: S(n) = sum_{i=1}^n i^a × i^b = sum_{i=1}^n i^(a+b) This i^(a+b) from i=1 to n. For evaluation behavior this I I can use in capacity : sum_{i=1}^n i^c ≈ integral_{1}^n x^c dx, where c = a+b : integral_{1}^n x^c dx = [x^(c+1)/(c+1)]_{1}^n = (n^(c+1)/(c+1)) - 1/(c+1) For large n, first will so that: integral_{1}^n x^c dx ≈ n^(c+1)/(c+1) manner, S(n) ≈ n^(a+b+1)/(a+b+1) for large n. More exactly, can show, using that: S(n) = sum_{i=1}^n i^(a+b) ≈ n^(a+b+1)/(a+b+1) + O(n^(a+b)) behavior S(n) at n → ∞ is Θ(n^(a+b+1)). Evaluation: Solution correct and well correctly determines, that S(n) = sum_{i=1}^n i^(a+b), and correctly behavior Θ(n^(a+b+1)). Approach with using is and final resultSelf-reported
70.4%
OCRBench
Evaluation AI: GPT-4o Benchmark: GPQA, science 10-shot Parameter count: Proprietary/unknown Key observations: Our in-depth analysis shows GPT-4o has a significant knowledge gap, particularly in specialized scientific domains when compared to GPT-4. Its ability to provide accurate, clear scientific explanations is inconsistent and tends to break down with increased complexity. GPT-4o demonstrates a strange phenomenon where it provides more detail in some answers but diverges from accuracy - suggesting potential issues in its knowledge weighting or confidence calibration.Self-reported
88.5%
OCRBench-V2 (en)
Evaluation AI: ChatGPT (GPT-4) Methodology The analysis was performed by giving the AI access to the problems from the AIME 2023 test, in real time. The AI was asked to solve each problem one at a time, without prior knowledge of the problems. For each problem: 1. The problem was posed as written in the official test. 2. The AI was instructed to think step-by-step. 3. The AI was given opportunity to check its work. 4. The final answer was evaluated against the official solution. For evaluation, we used two criteria: - Correctness: Whether the final numerical answer matches the official answer. - Reasoning: Quality of the approach and mathematical reasoning. The score represents the number of problems correctly solved out of 15 total problems on the AIME. This is the same scoring method used for human participants.Self-reported
61.5%
OSWorld
## Evaluation We we use approach to evaluation models. He includes specific answers and evaluation other. For tests with clearly answers, we we use evaluation: - All answers are evaluated We high accuracy comparison answers model with reference answers. - Since some tasks can have set correct ways expressions one and that indeed answer, we how answer model, so and answer before For tests with answers, complex reasoning or answers, we on evaluation: - Evaluation on basis with experts in domain field. - These evaluate several aspects answer, including accuracy, and this trained in fields. For specific tests we also we use : - More model evaluate answers models size. - We with help We thoroughly we verify, that this matches evaluationwhich would peopleSelf-reported
8.8%
PerceptionTest
# Evaluation methodology evaluation QWA, we we evaluate work by scale in each from three key : understanding and quality research. We we provide for explanations points and evaluation from 1 to 5 ## 3: prompts, and 2: but analysis approach - 1: explanation testing without analysis ## Understanding - 3: understanding LLM, prompting, limitations and 2: understanding between methods and results - 1: analysis without understanding work LLM ## Quality research - 3: methodology with and 2: process testing, but methodology - 1: or testing without methodology ## General evaluation (5 ) - ★★★★★: work (9 points) - ★★★★☆: work (8 points) - ★★★★: work (7 points) - ★★★☆: work (6 points) - ★★★: Standard work (5 points) - ★★☆: (4 points) - ★★: work (3 points) - ★☆: work (2 points) - ★: work (1 score)Self-reported
73.2%
ScreenSpot
Score EvaluationSelf-reported
87.1%
ScreenSpot Pro
ScoreSelf-reported
43.6%
TempCompass
# Score In this how evaluate performance models on each Although for each benchmark evaluation, which on "specific" (etc.experts) answers on each task, important note, that our evaluation on from Gemini, and, its and What still more important, each benchmark has various metrics and evaluation quality answers; we we consider these details below. For tasks mathematics (MATH, AIME, IMO, Putnam, FrontierMath), we used Gemini for classification answers how "correct" or "incorrect", from 0 to 5 for MATH and from 0 to 1 for other. In case OMC (tasks with multiple choice), we matches whether final choice with correct. For tasks GPQA we used simple match in order to evaluate, correctly whether answerSelf-reported
74.8%
VideoMME w/o sub.
Score EvaluationSelf-reported
73.3%

License & Metadata

License
tongyi_qianwen
Announcement Date
January 26, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.