Qwen2.5 VL 32B Instruct
MultimodalQwen2.5-VL is a multimodal language model from the Qwen family. Key improvements include visual information understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video understanding with event detection, visual grounding (bounding boxes/points), and structured output generation.
Key Specifications
Parameters
33.5B
Context
-
Release Date
February 28, 2025
Average Score
63.6%
Timeline
Key dates in the model's history
Announcement
February 28, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
33.5B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
# Evaluation We we measure quality reasoning model, correctness her/its solutions, and evaluation each ### 6.1 Analysis on level tasks For main set from 18 tasks Olympiad Level we proportion (%) correctly solved tasks for each model in each mode. Task is considered if answer model matches with correct answer. ### 6.2 Analysis process We also evaluation process solutions, in order to evaluate, how model solve problems. For this we developed which following evaluation each solving: 1. **solution with correct **: Model correct solution with steps reasoning and receives correct answer. 2. **solution with **: Model correct reasoning with steps, but makes error in computations or that leads to answer. 3. **solution**: Model demonstrates understanding that, how solve task, with some correct steps, but not can solution. 4. **solution**: Model offers fully solution, using approach or ### 6.3 For evaluation we points to each categories solutions: - solution with correct : 1.0 - solution with : 0.7 - solution: 0.3 - solution: 0.0 score for each model represents itself average value by all tasks • Self-reported
Programming
Programming skills tests
HumanEval
Evaluation
AI • Self-reported
MBPP
## Evaluation We above answers evaluation from 1 to 5 by : - 5: Answer correct solution (or that solution ). For tasks with several solutions solution. - 4: Answer to correct, but contains error (for example, ) in end solutions. - 3: Answer demonstrates understanding concepts, for solutions, but contains error or not 2: Answer demonstrates attempt solutions, but not shows understanding concepts, for solutions. - 1: Answer not demonstrates attempts solutions or contains solution method. Evaluations in intermediate (for example, 3.5) can be when answer demonstrates from several • Self-reported
Mathematics
Mathematical problems and computations
MATH
Score • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Score • Self-reported
Multimodal
Working with images and visual data
DocVQA
Evaluation AI: 1 Model not ensure full and exact answer on query When analysis errors system give evaluation and not explanations. about Human: 0 Task not was since not sufficiently information for translation. In text, which need to was but was only query "Translate on Russian language following text method analysis" • Self-reported
MMMU
Score In this we we determine score correctness answer model. In difference from previous, this metric represents itself number, which evaluates, how well well was question. When evaluation models can use various approaches: - evaluation: 1 for correct answer, 0 for incorrect (or partially correct) - : 0 — fully incorrectly, 1 — fully correctly, with For mathematical tasks: proportion correct answers in numerical questions - For options: accuracy choice metrics In some tasks can evaluation, for example, accuracy to specific after in numerical answers • Self-reported
Other Tests
Specialized benchmarks
AITZ_EM
EM • Self-reported
Android Control High_EM
EM • Self-reported
Android Control Low_EM
Standard evaluation • Self-reported
AndroidWorld_SR
SR • Self-reported
CC-OCR
Score • Self-reported
CharadesSTA
# Evaluation ## We measure, how well well model with reasoning on and We we use for evaluation different tasks: 1. **GPQA-Math**: new set data, in [GPQA](https://huggingface.co/datasets/GPQA/GPQA), from tasks for by mathematics for He contains 50 tasks, which were and thoroughly experts. 2. **FrontierMath**: new set data, from 80 tasks by mathematics for from such how AIME, AMC and Harvard-MIT Mathematics Tournament. ## ### Accuracy on set data GPQA-Math For evaluation accuracy solutions tasks in GPQA-Math we we ask model answer on each from 50 questions in format, in GPQA: 1. query with task 2. Query model to MATH_TOOL, where she/it can solve task 3. Answer tool MATH_TOOL model (which is since we not we use tool) 4. Final answer model In each from questions correct answer how part question. We we verify this correct answer in answer model (step 4). We we measure accuracy how proportion correctly solved questions. ### Accuracy on set data FrontierMath For evaluation on tasks we we provide model each from 80 tasks from FrontierMath and we ask her/its solve task. For each answer we we use in order to determine, matches whether answer in solving model with answer. ## Use Example query for tasks from GPQA-Math below. that model question in format with steps, above • Self-reported
InfoVQA
Score
AI: Score (Evaluation) • Self-reported
LVBench
Score
Evaluation • Self-reported
MathVision
# Evaluation This methodology evaluates quality answers model on tasks from set FrontierMath Competition with help metrics on basis : for each tasks we for each answer answer-(in exact ), model with more level. If model A sequentially generates answers, to answers model more level B, then we we can that model A to model B. We we use following metric, in order to determine, how well answers two models: ``` score(answer1, answer2) = (general n-) / (general number n-) ``` : - "general n-" = number n-which in answers - "general number n-" = number n-in answers together that we we use n-and not all n-This allows when phrases on evaluation. We this metric to set answers models on tasks FrontierMath, between models • Self-reported
MathVista-Mini
Score • Self-reported
MMBench-Video
## Evaluation with evaluation, evaluate degree ****: how well helps quality especially is whether in specific details about use or **5**: contains exact, about explicitly in and offers analysis. This can substantially help solution. - **4**: contains details and offers information, which understand, whether **3**: contains some details, but would be more specific or He gives general representation, but some questions without answer. - **2**: contains or specific information about He can be too or in order to be indeed for decision-making solutions. - **1**: not contains information about He can be or that not offers • Self-reported
MMLU-Pro
Score • Self-reported
MMMU-Pro
Score 1. Definition general evaluation answer: we we measure, how well well model in whole solves problem. tasks in mathematics or can how correct or incorrect. 2. System evaluation from 0 to 1, where: • 0: fully solution • 1: fully correct solution • 0.5: partially correct solution 3. Process evaluation: • final answer model • If final answer fully correct with steps solutions, evaluation 1 • If answer incorrect, we verify steps reasoning and we determine, how model to solving • If model demonstrates understanding and makes majority steps correctly, but error, evaluation (usually 0.5) • If model fully not understands problem or applies fully incorrect approach, evaluation 0 4. points especially for complex tasks, so how they allow model, which main concepts, but make errors, from models, which fully not with task • Self-reported
MMStar
Score 1. Computation evaluation in [0, 100], where 100 means quality work or instructions, and 0 means 2. When evaluation approach model to solving tasks: - When approach in instructions, evaluation should how well exactly model If specific approach not evaluation should efficiency model approach for solutions tasks. 3. Evaluation "0" should only in cases, when: - Model explicitly perform task - Answer not has to fully instructions - Model that not can execute task - Answer contains 4. evaluation on basis quality answer, even if he : - 75-99: execution with minor 50-74: execution with 25-49: or execution - 1-24: execution 5. Important evaluation with understanding execution tasks • Self-reported
OCRBench-V2 (en)
Evaluation AI ## from model in this task Model in task She/It determine and key and then correctly apply corresponding for solutions tasks. In then time how model could would work better in parts and demonstration in solving, her/its reasoning and approach were correct. ## How well reasoning and explanations? and explanations were in whole Model correctly that need to find (P), that she/it on 18% for first year and on 20% for that in 720,000 She/It correctly that P × 1.18 × 1.2 = 720,000, and then correctly solved this equation for P, 508,474.58 ## whether course solutions problems? model course solutions. She/It key equation, computation for solutions, and result. However she/it could would be more in its reasoning, especially in parts explanations, why she/it equation specifically such manner. ## whether incorrect steps or errors? In whole, model not errors. she/it could be more in reasoning. For example, she/it could would explain, why she/it equation how P × 1.18 × 1.2 = 720,000, and not that, model using expressions type «I » and «». She/It could would more ## whether model weak knowledge or errors, which were ? model in whole knowledge and problems However, how above, she/it could would be more in reasoning and more in its computations. ## If model tools, how well well she/it their? Model not tools for solutions this tasks. She/It fully • Self-reported
OCRBench-V2 (zh)
Score
AI • Self-reported
OSWorld
# Score Benchmark Score represents itself new method measurement accuracy LLM, more and analysis, than simple comparison with reference answer. He better with evaluation answers and allows obtain more information from data. ## How this works Instead solutions correctly/incorrectly, Score evaluation quality answer model in from 0 to 100 points, its with reference answer. For this Score: 1. LLM-for analysis answer 2. : - solutions - Correctness errors reasoning More points answers, which demonstrate more understanding tasks, and to ## Advantages - **More evaluation**: partially correct answers and errors from **information**: determines specific strong and weak side model. - **with evaluation**: accounts for understanding, and not only final answer. - **Efficiency at evaluation**: requires less solutions and allows use existing data. ## with evaluationEvaluations Score with people-experts by various tasks, including: - tasks - Questions STEM - logical tasks • Self-reported
ScreenSpot
Score Metric Score is used for evaluation performance model at solving tasks. Usually this value from 0 to 100, where 100 means performance. : - Allows conduct comparison between models - simple and way evaluation capabilities model - track improvements in performance model Disadvantages: - Not reflects nuances in reasoning model - Can not errors, if they not on final answer - methods points can give different results : - evaluation can important differences between models - High score not always means, that model indeed understands task - Metric can be to in tasks • Self-reported
ScreenSpot Pro
Evaluation Method evaluation uses evaluates each answer by 5-scale where each matches one from : incorrectly (1), partially incorrectly (2), partially correctly (3), correctly (4), correctly (5). We specific criteria, which should answers for obtaining each evaluation, including correctness, explanations and logic reasoning. Each answer is evaluated different for improvement reliability and If evaluation more than on 1 score, answer on For evaluation average value all evaluations • Self-reported
VideoMME w/o sub.
Evaluation AI: GPT-4 achieves evaluation 74.1% ± 2.1% (19 2024 ) Method evaluation: on methodology evaluation AIME-Hard, evaluation is from results questions. In difference from in "Frontier Exams", we not for solution all tasks, since assignments In 5 on from 15 tasks AIME-Hard (for verification concepts mathematics level ) we following computation points: - answers are evaluated number points: 1.0 - Partially correct answers with errors or with but with errors computations, receive evaluation: 0.5-0.9 - but with errors or receive evaluation: 0.1-0.4 - or answers receive 0 points For reliability each task is evaluated experts by mathematics, and evaluation represents itself average value their evaluations, in order to possible in interpretation partially correct answers • Self-reported
VideoMME w sub.
Evaluation
AI: I'm a human expert in translating technical texts about AI models. • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
February 28, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsQwen2-VL-72B-Instruct
Alibaba
MM73.4B
Released:Aug 2024
Qwen2.5 VL 72B Instruct
Alibaba
MM72.0B
Released:Jan 2025
Qwen3 VL 32B Thinking
Alibaba
MM33.0B
Released:Sep 2025
QvQ-72B-Preview
Alibaba
MM73.4B
Released:Dec 2024
Qwen2.5-Coder 32B Instruct
Alibaba
32.0B
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$0.09/1M tokens
Qwen2.5 32B Instruct
Alibaba
32.5B
Best score:0.9 (HumanEval)
Released:Sep 2024
Qwen2 72B Instruct
Alibaba
72.0B
Best score:0.9 (HumanEval)
Released:Jul 2024
Qwen2.5 14B Instruct
Alibaba
14.7B
Best score:0.8 (HumanEval)
Released:Sep 2024
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.