Alibaba logo

Qwen2.5 VL 32B Instruct

Multimodal
Alibaba

Qwen2.5-VL is a multimodal language model from the Qwen family. Key improvements include visual information understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video understanding with event detection, visual grounding (bounding boxes/points), and structured output generation.

Key Specifications

Parameters
33.5B
Context
-
Release Date
February 28, 2025
Average Score
63.6%

Timeline

Key dates in the model's history
Announcement
February 28, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
33.5B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
# Evaluation We we measure quality reasoning model, correctness her/its solutions, and evaluation each ### 6.1 Analysis on level tasks For main set from 18 tasks Olympiad Level we proportion (%) correctly solved tasks for each model in each mode. Task is considered if answer model matches with correct answer. ### 6.2 Analysis process We also evaluation process solutions, in order to evaluate, how model solve problems. For this we developed which following evaluation each solving: 1. **solution with correct **: Model correct solution with steps reasoning and receives correct answer. 2. **solution with **: Model correct reasoning with steps, but makes error in computations or that leads to answer. 3. **solution**: Model demonstrates understanding that, how solve task, with some correct steps, but not can solution. 4. **solution**: Model offers fully solution, using approach or ### 6.3 For evaluation we points to each categories solutions: - solution with correct : 1.0 - solution with : 0.7 - solution: 0.3 - solution: 0.0 score for each model represents itself average value by all tasksSelf-reported
78.4%

Programming

Programming skills tests
HumanEval
Evaluation AISelf-reported
91.5%
MBPP
## Evaluation We above answers evaluation from 1 to 5 by : - 5: Answer correct solution (or that solution ). For tasks with several solutions solution. - 4: Answer to correct, but contains error (for example, ) in end solutions. - 3: Answer demonstrates understanding concepts, for solutions, but contains error or not 2: Answer demonstrates attempt solutions, but not shows understanding concepts, for solutions. - 1: Answer not demonstrates attempts solutions or contains solution method. Evaluations in intermediate (for example, 3.5) can be when answer demonstrates from severalSelf-reported
84.0%

Mathematics

Mathematical problems and computations
MATH
ScoreSelf-reported
82.2%

Reasoning

Logical reasoning and analysis
GPQA
ScoreSelf-reported
46.0%

Multimodal

Working with images and visual data
DocVQA
Evaluation AI: 1 Model not ensure full and exact answer on query When analysis errors system give evaluation and not explanations. about Human: 0 Task not was since not sufficiently information for translation. In text, which need to was but was only query "Translate on Russian language following text method analysis"Self-reported
94.8%
MMMU
Score In this we we determine score correctness answer model. In difference from previous, this metric represents itself number, which evaluates, how well well was question. When evaluation models can use various approaches: - evaluation: 1 for correct answer, 0 for incorrect (or partially correct) - : 0 — fully incorrectly, 1 — fully correctly, with For mathematical tasks: proportion correct answers in numerical questions - For options: accuracy choice metrics In some tasks can evaluation, for example, accuracy to specific after in numerical answersSelf-reported
70.0%

Other Tests

Specialized benchmarks
AITZ_EM
EMSelf-reported
83.1%
Android Control High_EM
EMSelf-reported
69.6%
Android Control Low_EM
Standard evaluationSelf-reported
93.3%
AndroidWorld_SR
SRSelf-reported
22.0%
CC-OCR
ScoreSelf-reported
77.1%
CharadesSTA
# Evaluation ## We measure, how well well model with reasoning on and We we use for evaluation different tasks: 1. **GPQA-Math**: new set data, in [GPQA](https://huggingface.co/datasets/GPQA/GPQA), from tasks for by mathematics for He contains 50 tasks, which were and thoroughly experts. 2. **FrontierMath**: new set data, from 80 tasks by mathematics for from such how AIME, AMC and Harvard-MIT Mathematics Tournament. ## ### Accuracy on set data GPQA-Math For evaluation accuracy solutions tasks in GPQA-Math we we ask model answer on each from 50 questions in format, in GPQA: 1. query with task 2. Query model to MATH_TOOL, where she/it can solve task 3. Answer tool MATH_TOOL model (which is since we not we use tool) 4. Final answer model In each from questions correct answer how part question. We we verify this correct answer in answer model (step 4). We we measure accuracy how proportion correctly solved questions. ### Accuracy on set data FrontierMath For evaluation on tasks we we provide model each from 80 tasks from FrontierMath and we ask her/its solve task. For each answer we we use in order to determine, matches whether answer in solving model with answer. ## Use Example query for tasks from GPQA-Math below. that model question in format with steps, aboveSelf-reported
54.2%
InfoVQA
Score AI: Score (Evaluation)Self-reported
83.4%
LVBench
Score EvaluationSelf-reported
49.0%
MathVision
# Evaluation This methodology evaluates quality answers model on tasks from set FrontierMath Competition with help metrics on basis : for each tasks we for each answer answer-(in exact ), model with more level. If model A sequentially generates answers, to answers model more level B, then we we can that model A to model B. We we use following metric, in order to determine, how well answers two models: ``` score(answer1, answer2) = (general n-) / (general number n-) ``` : - "general n-" = number n-which in answers - "general number n-" = number n-in answers together that we we use n-and not all n-This allows when phrases on evaluation. We this metric to set answers models on tasks FrontierMath, between modelsSelf-reported
38.4%
MathVista-Mini
ScoreSelf-reported
74.7%
MMBench-Video
## Evaluation with evaluation, evaluate degree ****: how well helps quality especially is whether in specific details about use or **5**: contains exact, about explicitly in and offers analysis. This can substantially help solution. - **4**: contains details and offers information, which understand, whether **3**: contains some details, but would be more specific or He gives general representation, but some questions without answer. - **2**: contains or specific information about He can be too or in order to be indeed for decision-making solutions. - **1**: not contains information about He can be or that not offersSelf-reported
1.9%
MMLU-Pro
ScoreSelf-reported
68.8%
MMMU-Pro
Score 1. Definition general evaluation answer: we we measure, how well well model in whole solves problem. tasks in mathematics or can how correct or incorrect. 2. System evaluation from 0 to 1, where: • 0: fully solution • 1: fully correct solution • 0.5: partially correct solution 3. Process evaluation: • final answer model • If final answer fully correct with steps solutions, evaluation 1 • If answer incorrect, we verify steps reasoning and we determine, how model to solving • If model demonstrates understanding and makes majority steps correctly, but error, evaluation (usually 0.5) • If model fully not understands problem or applies fully incorrect approach, evaluation 0 4. points especially for complex tasks, so how they allow model, which main concepts, but make errors, from models, which fully not with taskSelf-reported
49.5%
MMStar
Score 1. Computation evaluation in [0, 100], where 100 means quality work or instructions, and 0 means 2. When evaluation approach model to solving tasks: - When approach in instructions, evaluation should how well exactly model If specific approach not evaluation should efficiency model approach for solutions tasks. 3. Evaluation "0" should only in cases, when: - Model explicitly perform task - Answer not has to fully instructions - Model that not can execute task - Answer contains 4. evaluation on basis quality answer, even if he : - 75-99: execution with minor 50-74: execution with 25-49: or execution - 1-24: execution 5. Important evaluation with understanding execution tasksSelf-reported
69.5%
OCRBench-V2 (en)
Evaluation AI ## from model in this task Model in task She/It determine and key and then correctly apply corresponding for solutions tasks. In then time how model could would work better in parts and demonstration in solving, her/its reasoning and approach were correct. ## How well reasoning and explanations? and explanations were in whole Model correctly that need to find (P), that she/it on 18% for first year and on 20% for that in 720,000 She/It correctly that P × 1.18 × 1.2 = 720,000, and then correctly solved this equation for P, 508,474.58 ## whether course solutions problems? model course solutions. She/It key equation, computation for solutions, and result. However she/it could would be more in its reasoning, especially in parts explanations, why she/it equation specifically such manner. ## whether incorrect steps or errors? In whole, model not errors. she/it could be more in reasoning. For example, she/it could would explain, why she/it equation how P × 1.18 × 1.2 = 720,000, and not that, model using expressions type «I » and «». She/It could would more ## whether model weak knowledge or errors, which were ? model in whole knowledge and problems However, how above, she/it could would be more in reasoning and more in its computations. ## If model tools, how well well she/it their? Model not tools for solutions this tasks. She/It fullySelf-reported
57.2%
OCRBench-V2 (zh)
Score AISelf-reported
59.1%
OSWorld
# Score Benchmark Score represents itself new method measurement accuracy LLM, more and analysis, than simple comparison with reference answer. He better with evaluation answers and allows obtain more information from data. ## How this works Instead solutions correctly/incorrectly, Score evaluation quality answer model in from 0 to 100 points, its with reference answer. For this Score: 1. LLM-for analysis answer 2. : - solutions - Correctness errors reasoning More points answers, which demonstrate more understanding tasks, and to ## Advantages - **More evaluation**: partially correct answers and errors from **information**: determines specific strong and weak side model. - **with evaluation**: accounts for understanding, and not only final answer. - **Efficiency at evaluation**: requires less solutions and allows use existing data. ## with evaluationEvaluations Score with people-experts by various tasks, including: - tasks - Questions STEM - logical tasksSelf-reported
5.9%
ScreenSpot
Score Metric Score is used for evaluation performance model at solving tasks. Usually this value from 0 to 100, where 100 means performance. : - Allows conduct comparison between models - simple and way evaluation capabilities model - track improvements in performance model Disadvantages: - Not reflects nuances in reasoning model - Can not errors, if they not on final answer - methods points can give different results : - evaluation can important differences between models - High score not always means, that model indeed understands task - Metric can be to in tasksSelf-reported
88.5%
ScreenSpot Pro
Evaluation Method evaluation uses evaluates each answer by 5-scale where each matches one from : incorrectly (1), partially incorrectly (2), partially correctly (3), correctly (4), correctly (5). We specific criteria, which should answers for obtaining each evaluation, including correctness, explanations and logic reasoning. Each answer is evaluated different for improvement reliability and If evaluation more than on 1 score, answer on For evaluation average value all evaluationsSelf-reported
39.4%
VideoMME w/o sub.
Evaluation AI: GPT-4 achieves evaluation 74.1% ± 2.1% (19 2024 ) Method evaluation: on methodology evaluation AIME-Hard, evaluation is from results questions. In difference from in "Frontier Exams", we not for solution all tasks, since assignments In 5 on from 15 tasks AIME-Hard (for verification concepts mathematics level ) we following computation points: - answers are evaluated number points: 1.0 - Partially correct answers with errors or with but with errors computations, receive evaluation: 0.5-0.9 - but with errors or receive evaluation: 0.1-0.4 - or answers receive 0 points For reliability each task is evaluated experts by mathematics, and evaluation represents itself average value their evaluations, in order to possible in interpretation partially correct answersSelf-reported
70.5%
VideoMME w sub.
Evaluation AI: I'm a human expert in translating technical texts about AI models.Self-reported
77.9%

License & Metadata

License
apache_2_0
Announcement Date
February 28, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.