Qwen2.5 VL 32B Instruct

Name: Qwen2.5 VL 32B Instruct
Author: Alibaba

Multimodal

Alibaba

Qwen2.5-VL is a multimodal language model from the Qwen family. Key improvements include visual information understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video understanding with event detection, visual grounding (bounding boxes/points), and structured output generation.

Key Specifications

Parameters

33.5B

Context

Release Date

February 28, 2025

Average Score

63.6%

Research Paper Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

February 28, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

33.5B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

# Evaluation We we measure quality reasoning model, correctness her/its solutions, and evaluation each ### 6.1 Analysis on level tasks For main set from 18 tasks Olympiad Level we proportion (%) correctly solved tasks for each model in each mode. Task is considered if answer model matches with correct answer. ### 6.2 Analysis process We also evaluation process solutions, in order to evaluate, how model solve problems. For this we developed which following evaluation each solving: 1. **solution with correct **: Model correct solution with steps reasoning and receives correct answer. 2. **solution with **: Model correct reasoning with steps, but makes error in computations or that leads to answer. 3. **solution**: Model demonstrates understanding that, how solve task, with some correct steps, but not can solution. 4. **solution**: Model offers fully solution, using approach or ### 6.3 For evaluation we points to each categories solutions: - solution with correct : 1.0 - solution with : 0.7 - solution: 0.3 - solution: 0.0 score for each model represents itself average value by all tasks • Self-reported

78.4%

Programming

Programming skills tests

HumanEval

Evaluation AI • Self-reported

91.5%

MBPP

## Evaluation We above answers evaluation from 1 to 5 by : - 5: Answer correct solution (or that solution ). For tasks with several solutions solution. - 4: Answer to correct, but contains error (for example, ) in end solutions. - 3: Answer demonstrates understanding concepts, for solutions, but contains error or not 2: Answer demonstrates attempt solutions, but not shows understanding concepts, for solutions. - 1: Answer not demonstrates attempts solutions or contains solution method. Evaluations in intermediate (for example, 3.5) can be when answer demonstrates from several • Self-reported

84.0%

Mathematics

Mathematical problems and computations

MATH

Score • Self-reported

82.2%

Reasoning

Logical reasoning and analysis

GPQA

Score • Self-reported

46.0%

Multimodal

Working with images and visual data

DocVQA

Evaluation AI: 1 Model not ensure full and exact answer on query When analysis errors system give evaluation and not explanations. about Human: 0 Task not was since not sufficiently information for translation. In text, which need to was but was only query "Translate on Russian language following text method analysis" • Self-reported

94.8%

MMMU

Score In this we we determine score correctness answer model. In difference from previous, this metric represents itself number, which evaluates, how well well was question. When evaluation models can use various approaches: - evaluation: 1 for correct answer, 0 for incorrect (or partially correct) - : 0 — fully incorrectly, 1 — fully correctly, with For mathematical tasks: proportion correct answers in numerical questions - For options: accuracy choice metrics In some tasks can evaluation, for example, accuracy to specific after in numerical answers • Self-reported

70.0%

Other Tests

Specialized benchmarks

AITZ_EM

EM • Self-reported

83.1%

Android Control High_EM

EM • Self-reported

69.6%

Android Control Low_EM

Standard evaluation • Self-reported

93.3%

AndroidWorld_SR

SR • Self-reported

22.0%

CC-OCR

Score • Self-reported

77.1%

CharadesSTA

# Evaluation ## We measure, how well well model with reasoning on and We we use for evaluation different tasks: 1. **GPQA-Math**: new set data, in [GPQA](https://huggingface.co/datasets/GPQA/GPQA), from tasks for by mathematics for He contains 50 tasks, which were and thoroughly experts. 2. **FrontierMath**: new set data, from 80 tasks by mathematics for from such how AIME, AMC and Harvard-MIT Mathematics Tournament. ## ### Accuracy on set data GPQA-Math For evaluation accuracy solutions tasks in GPQA-Math we we ask model answer on each from 50 questions in format, in GPQA: 1. query with task 2. Query model to MATH_TOOL, where she/it can solve task 3. Answer tool MATH_TOOL model (which is since we not we use tool) 4. Final answer model In each from questions correct answer how part question. We we verify this correct answer in answer model (step 4). We we measure accuracy how proportion correctly solved questions. ### Accuracy on set data FrontierMath For evaluation on tasks we we provide model each from 80 tasks from FrontierMath and we ask her/its solve task. For each answer we we use in order to determine, matches whether answer in solving model with answer. ## Use Example query for tasks from GPQA-Math below. that model question in format with steps, above • Self-reported

54.2%

InfoVQA

Score AI: Score (Evaluation) • Self-reported

83.4%

LVBench

Score Evaluation • Self-reported

49.0%

MathVision

# Evaluation This methodology evaluates quality answers model on tasks from set FrontierMath Competition with help metrics on basis : for each tasks we for each answer answer-(in exact ), model with more level. If model A sequentially generates answers, to answers model more level B, then we we can that model A to model B. We we use following metric, in order to determine, how well answers two models: ``` score(answer1, answer2) = (general n-) / (general number n-) ``` : - "general n-" = number n-which in answers - "general number n-" = number n-in answers together that we we use n-and not all n-This allows when phrases on evaluation. We this metric to set answers models on tasks FrontierMath, between models • Self-reported

38.4%

MathVista-Mini

Score • Self-reported

74.7%

MMBench-Video

## Evaluation with evaluation, evaluate degree ****: how well helps quality especially is whether in specific details about use or **5**: contains exact, about explicitly in and offers analysis. This can substantially help solution. - **4**: contains details and offers information, which understand, whether **3**: contains some details, but would be more specific or He gives general representation, but some questions without answer. - **2**: contains or specific information about He can be too or in order to be indeed for decision-making solutions. - **1**: not contains information about He can be or that not offers • Self-reported

1.9%

MMLU-Pro

Score • Self-reported

68.8%

MMMU-Pro

Score 1. Definition general evaluation answer: we we measure, how well well model in whole solves problem. tasks in mathematics or can how correct or incorrect. 2. System evaluation from 0 to 1, where: • 0: fully solution • 1: fully correct solution • 0.5: partially correct solution 3. Process evaluation: • final answer model • If final answer fully correct with steps solutions, evaluation 1 • If answer incorrect, we verify steps reasoning and we determine, how model to solving • If model demonstrates understanding and makes majority steps correctly, but error, evaluation (usually 0.5) • If model fully not understands problem or applies fully incorrect approach, evaluation 0 4. points especially for complex tasks, so how they allow model, which main concepts, but make errors, from models, which fully not with task • Self-reported

49.5%

MMStar

Score 1. Computation evaluation in [0, 100], where 100 means quality work or instructions, and 0 means 2. When evaluation approach model to solving tasks: - When approach in instructions, evaluation should how well exactly model If specific approach not evaluation should efficiency model approach for solutions tasks. 3. Evaluation "0" should only in cases, when: - Model explicitly perform task - Answer not has to fully instructions - Model that not can execute task - Answer contains 4. evaluation on basis quality answer, even if he : - 75-99: execution with minor 50-74: execution with 25-49: or execution - 1-24: execution 5. Important evaluation with understanding execution tasks • Self-reported

69.5%

OCRBench-V2 (en)

Evaluation AI ## from model in this task Model in task She/It determine and key and then correctly apply corresponding for solutions tasks. In then time how model could would work better in parts and demonstration in solving, her/its reasoning and approach were correct. ## How well reasoning and explanations? and explanations were in whole Model correctly that need to find (P), that she/it on 18% for first year and on 20% for that in 720,000 She/It correctly that P × 1.18 × 1.2 = 720,000, and then correctly solved this equation for P, 508,474.58 ## whether course solutions problems? model course solutions. She/It key equation, computation for solutions, and result. However she/it could would be more in its reasoning, especially in parts explanations, why she/it equation specifically such manner. ## whether incorrect steps or errors? In whole, model not errors. she/it could be more in reasoning. For example, she/it could would explain, why she/it equation how P × 1.18 × 1.2 = 720,000, and not that, model using expressions type «I » and «». She/It could would more ## whether model weak knowledge or errors, which were ? model in whole knowledge and problems However, how above, she/it could would be more in reasoning and more in its computations. ## If model tools, how well well she/it their? Model not tools for solutions this tasks. She/It fully • Self-reported

57.2%

OCRBench-V2 (zh)

Score AI • Self-reported

59.1%

OSWorld

# Score Benchmark Score represents itself new method measurement accuracy LLM, more and analysis, than simple comparison with reference answer. He better with evaluation answers and allows obtain more information from data. ## How this works Instead solutions correctly/incorrectly, Score evaluation quality answer model in from 0 to 100 points, its with reference answer. For this Score: 1. LLM-for analysis answer 2. : - solutions - Correctness errors reasoning More points answers, which demonstrate more understanding tasks, and to ## Advantages - **More evaluation**: partially correct answers and errors from **information**: determines specific strong and weak side model. - **with evaluation**: accounts for understanding, and not only final answer. - **Efficiency at evaluation**: requires less solutions and allows use existing data. ## with evaluationEvaluations Score with people-experts by various tasks, including: - tasks - Questions STEM - logical tasks • Self-reported

5.9%

ScreenSpot

Score Metric Score is used for evaluation performance model at solving tasks. Usually this value from 0 to 100, where 100 means performance. : - Allows conduct comparison between models - simple and way evaluation capabilities model - track improvements in performance model Disadvantages: - Not reflects nuances in reasoning model - Can not errors, if they not on final answer - methods points can give different results : - evaluation can important differences between models - High score not always means, that model indeed understands task - Metric can be to in tasks • Self-reported

88.5%

ScreenSpot Pro

Evaluation Method evaluation uses evaluates each answer by 5-scale where each matches one from : incorrectly (1), partially incorrectly (2), partially correctly (3), correctly (4), correctly (5). We specific criteria, which should answers for obtaining each evaluation, including correctness, explanations and logic reasoning. Each answer is evaluated different for improvement reliability and If evaluation more than on 1 score, answer on For evaluation average value all evaluations • Self-reported

39.4%

VideoMME w/o sub.

Evaluation AI: GPT-4 achieves evaluation 74.1% ± 2.1% (19 2024 ) Method evaluation: on methodology evaluation AIME-Hard, evaluation is from results questions. In difference from in "Frontier Exams", we not for solution all tasks, since assignments In 5 on from 15 tasks AIME-Hard (for verification concepts mathematics level ) we following computation points: - answers are evaluated number points: 1.0 - Partially correct answers with errors or with but with errors computations, receive evaluation: 0.5-0.9 - but with errors or receive evaluation: 0.1-0.4 - or answers receive 0 points For reliability each task is evaluated experts by mathematics, and evaluation represents itself average value their evaluations, in order to possible in interpretation partially correct answers • Self-reported

70.5%

VideoMME w sub.

Evaluation AI: I'm a human expert in translating technical texts about AI models. • Self-reported

77.9%

License & Metadata

License

apache_2_0

Announcement Date

February 28, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Qwen2-VL-72B-Instruct

Alibaba

MM73.4B

Released:Aug 2024

Qwen2.5 VL 72B Instruct

Alibaba

MM72.0B

Released:Jan 2025

Qwen3 VL 32B Thinking

Alibaba

MM33.0B

Released:Sep 2025

QvQ-72B-Preview

Alibaba

MM73.4B

Released:Dec 2024

Qwen2.5-Coder 32B Instruct

Alibaba

32.0B

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$0.09/1M tokens

Qwen2.5 32B Instruct

Alibaba

32.5B

Best score:0.9 (HumanEval)

Released:Sep 2024

Qwen2 72B Instruct

Alibaba

72.0B

Best score:0.9 (HumanEval)

Released:Jul 2024

Qwen2.5 14B Instruct

Alibaba

14.7B

Best score:0.8 (HumanEval)

Released:Sep 2024

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.