Qwen2.5 VL 7B Instruct

Name: Qwen2.5 VL 7B Instruct
Author: Alibaba

Multimodal

Alibaba

Qwen2.5-VL is a multimodal language model from the Qwen family. Key improvements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video understanding with event detection, visual grounding (bounding boxes/points), and structured output generation.

Key Specifications

Parameters

8.3B

Context

Release Date

January 26, 2025

Average Score

64.5%

Research Paper Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

January 26, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

8.3B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data

ChartQA

Score In with evaluation Anthropic, Claude answers numerical evaluation from 1 to 10, where 10 indicates on correct answers, and 1 on fully incorrect. Claude uses logic, computation and its internal knowledge for evaluation correctness answers. mathematical answers, Claude determine, whether answer through correct computation and reasoning. If answer and reasoning fully Claude evaluation 10. If answer but methodology contains some correct steps, Claude evaluation, degree correctness method. If reasoning and answer fully Claude evaluation 1. For cases Claude reasoning, and general solutions • Self-reported

87.3%

DocVQA

Score • Self-reported

95.7%

MMMU

Evaluation AI: I'll generate a number from 1 to 10 to assess the model's response based on quality, relevance, and reasoning: 1-3: Poor response, significant errors or irrelevance 4-6: Adequate response with some issues 7-8: Good response with minor flaws 9-10: Excellent, comprehensive response My score: [number] Justification: [detailed explanation for the score, highlighting strengths and weaknesses] • Self-reported

58.6%

Other Tests

Specialized benchmarks

AITZ_EM

EM • Self-reported

81.9%

Android Control High_EM

EM • Self-reported

60.1%

Android Control Low_EM

EM • Self-reported

91.4%

AndroidWorld_SR

SR • Self-reported

25.5%

CC-OCR

Evaluation AI • Self-reported

77.8%

CharadesSTA

mIoU • Self-reported

43.6%

Hallusion Bench

Score Evaluation • Self-reported

52.9%

InfoVQA

Evaluation AI: I'm Claude, and I want to help you evaluate the scores for your dataset, which is the most important contribution of your research. Rater: I have a dataset with 1,000 model outputs, each rated on a 5-point scale. How do I analyze the scores? AI: For a dataset with 1,000 model outputs rated on a 5-point scale, you should: 1. Calculate basic statistics: mean, median, mode, standard deviation 2. Analyze score distribution (histogram) 3. Perform reliability tests (inter-rater agreement if multiple raters) 4. Compare subgroups if applicable 5. Run statistical tests appropriate for ordinal data 6. Consider normalized scores for comparison across different scales This will give you a comprehensive understanding of how your model performs according to the ratings. Rater: [THINKING] The AI's response covers the essential aspects of score analysis for a dataset rated on a 5-point scale. It correctly identifies the need for descriptive statistics, distribution visualization, reliability metrics, comparative analysis, appropriate statistical testing, and normalization considerations. The advice is technically sound and aligned with standard practices in dataset evaluation. The response doesn't make assumptions about the specific domain and provides general best practices that would apply across various types of rating tasks. I would rate this answer as comprehensive and technically accurate. • Self-reported

82.6%

LongVideoBench

# Evaluation Let us how well correct general answer gives in this task. answer: 34. gives answer 34, that is correct. Reasoning in this task thoroughly and he sequentially numbers 2m, where m - number. He each such number in form 2·5^a·k, where k simply with 10, and demonstrates correct understanding that, that depends from numbers 2, not 5. correctly that has 4, and correctly determines, that this will sequence 6, 2, 4, 8. Then he performs 2022 on 4, in order to find in and that 2022 = 505·4 + 2, that indicates on then, that will 4. He also correctly number in which 1 + 2 + ... + 2022 = 2022·2023/2 = 2,045,253. Using and samples for large to that answer 34, that matches with answer • Self-reported

54.7%

LVBench

Score Evaluation • Self-reported

45.3%

MathVision

Score Each task is evaluated one from points: - **(1.0)**: Fully solution with or **Partially correctly (0.5)**: correct solution, but with some steps, or **(0.0)**: errors, significant mathematical errors, or solve task. Model receives 0 for answer or when she/it explicitly that not can solve task • Self-reported

25.1%

MathVista-Mini

Score Method Score helps evaluate answers language model with using model. This approach uses prompt, in order to model evaluate its own answers. Such method also how "evaluation" (self-evaluation). Method Score can use by-: - in prompt instruction by prompt with answer - complex forms with : model evaluates its answer by criteria • Self-reported

68.2%

MLVU

Score Evaluation • Self-reported

70.2%

MMBench

Score Evaluation • Self-reported

84.3%

MMBench-Video

Score Evaluation • Self-reported

1.8%

MMMU-Pro

Score • Self-reported

38.3%

MMStar

Score • Self-reported

63.9%

MMT-Bench

## Evaluation We we determine **evaluation** model following manner: ``` Evaluation = Performance tasks * 2/3 + * 1/3 ``` Performance tasks — this proportion tasks, which model solves correctly. — this proportion tasks, where model uses methods solutions, or by efficiency (by speed or ) us solving this tasks. solutions we "". manner, evaluation model constitutes 1,0. Important note, that evaluation is specific set tasks; especially and can in dependency from tasks, and etc.etc • Self-reported

63.6%

MMVet

Score Model large improvements by comparison with GPT-4, but by-not demonstrates level thinking. all improvements on "more " tasks (AIME, in ), but very in "more complex" tasks (USAMO, IMO and part GPQA). In : - In AIME model 10/15 (67%), that on 67% better, than GPT-4 (6/15). For comparison, score people constitutes about 3.5/15, and strong can obtain 10/15. - In USA(J)MO model solved 2/6 tasks. Although this better, than GPT-4 (0/6), this by-below, than at people. For score constitutes about 2/6, and usually solve 5-6/6. - In IMO model solved 0/6 tasks, how and GPT-4. However she/it several points in one from them, progress. For comparison, score people constitutes about 1/6, and usually solve 5-6/6. - In GPQA, in in tasks on mathematical evidence, model 31%, that better, than GPT-4 (21%). General evaluation: at solving mathematical model level in tasks AIME (67-), but significantly in tasks USA(J)MO (25-) and not can in tasks IMO (0-). abilities below level in (GPQA) • Self-reported

67.1%

MobileMiniWob++_SR

SR • Self-reported

91.4%

MVBench

Score Human evaluation answers, answers model with reference answers. Value: ensures score for different approaches : - compare between models and options - final result : - that about process reasoning - Can give evaluation results - evaluate partially correct answers Method: final answer with correct answer and usually it or evaluation • Self-reported

69.6%

OCRBench

Score • Self-reported

86.4%

PerceptionTest

Score AI • Self-reported

70.5%

ScreenSpot

# Evaluation For evaluation performance large language models on mathematical use accuracy answers. We that answer correct, if he matches with accuracy to mathematical For example, 1/2 and 0.5 answers. For tasks with multiple choice, model should choose option answer. This can be explicitly in form (for example, "A") or by means of provision answer, which matches only from options. If answer for example, if model provides final answer, which can interpret how options, we we consider this incorrect. For tasks with answer we we use in order to determine, matches whether answer model We with such how "answer: x", where x — answer. In this case we x and we compare its with For comparison numerical answers we we use after This means, that various, but forms one and that indeed answer (for example, "1/2" and "0.5") correct • Self-reported

84.7%

ScreenSpot Pro

Evaluation accuracy answers AI since questions in our questions not have correct answer. questions require reasoning and to ; other have several or options solutions. Models can use various methods reasoning and by-structure its answers. For each question we we determine answer and we provide how evaluate answers. specific which should in answer, correct reasoning, and errors, which can model. We with experts in subject fields and AI-for and verification these and we their in capacity often are used — experts in its fields — for evaluation each answer model. questions are evaluated by scale from 1 to 5, where: - 1: Fully incorrectly or not relates to to 3: Partially correctly or has significant /errors - 5: Fully correctly with correct reasoning We with in order to ensure evaluations, and we measure reliability • Self-reported

29.0%

TempCompass

Score Evaluation • Self-reported

71.7%

TextVQA

Score • Self-reported

84.9%

VideoMME w/o sub.

## Evaluation We we compute evaluation performance for each model, which reflects general in reasoning, in Algeba-gym. For tasks with binary evaluation (etc.solution correct or often with ), we simply points for obtaining accuracy, and then GPT-4. Evaluations for each test following: * : accuracy solutions on tasks from AIME and AMC. * GSM8K: accuracy solutions on GSM8K. * GPQA: match with answers, experts. * : tests. * reasoning: accuracy at answer on questions. * : accuracy at * on basis common (sense) meaning: match answers These scores in appendix, and we all our data in • Self-reported

65.1%

VideoMME w sub.

Evaluation AI: GPT-4o + Context is crucial for high performance. "Context" is additional data that helps LLMs perform well on a specific task. We typically supply context in several ways. We select prompts that increase the likelihood of good AI responses (though we can't control all AI behaviors). And we inject relevant data, like facts or instructions, that help the AI perform better on tasks like arithmetic. But how much context is needed to get optimal performance? Can we measure how much context a model actually uses? We conducted an experiment to understand how a model's accuracy varies with the amount of contextual information it receives. We used a dataset of 100 complex arithmetic expressions and provided the model with varying amounts of relevant facts. The results were striking. • Self-reported

71.6%

License & Metadata

License

apache_2_0

Announcement Date

January 26, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Qwen2.5-Omni-7B

Alibaba

MM7.0B

Best score:0.8 (HumanEval)

Released:Mar 2025

Qwen2-VL-72B-Instruct

Alibaba

MM73.4B

Released:Aug 2024

Qwen2.5 VL 72B Instruct

Alibaba

MM72.0B

Released:Jan 2025

Qwen3 VL 32B Thinking

Alibaba

MM33.0B

Released:Sep 2025

Qwen3.5-397B-A17B

Alibaba

MM397.0B

Released:Feb 2026

Qwen2.5 VL 32B Instruct

Alibaba

MM33.5B

Best score:0.9 (HumanEval)

Released:Feb 2025

QvQ-72B-Preview

Alibaba

MM73.4B

Released:Dec 2024

Qwen3.5 9B

Alibaba

9.0B

Released:Mar 2026

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.