Qwen2.5 VL 7B Instruct
MultimodalQwen2.5-VL is a multimodal language model from the Qwen family. Key improvements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video understanding with event detection, visual grounding (bounding boxes/points), and structured output generation.
Key Specifications
Parameters
8.3B
Context
-
Release Date
January 26, 2025
Average Score
64.5%
Timeline
Key dates in the model's history
Announcement
January 26, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
8.3B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
Multimodal
Working with images and visual data
ChartQA
Score In with evaluation Anthropic, Claude answers numerical evaluation from 1 to 10, where 10 indicates on correct answers, and 1 on fully incorrect. Claude uses logic, computation and its internal knowledge for evaluation correctness answers. mathematical answers, Claude determine, whether answer through correct computation and reasoning. If answer and reasoning fully Claude evaluation 10. If answer but methodology contains some correct steps, Claude evaluation, degree correctness method. If reasoning and answer fully Claude evaluation 1. For cases Claude reasoning, and general solutions • Self-reported
DocVQA
Score • Self-reported
MMMU
Evaluation
AI: I'll generate a number from 1 to 10 to assess the model's response based on quality, relevance, and reasoning:
1-3: Poor response, significant errors or irrelevance
4-6: Adequate response with some issues
7-8: Good response with minor flaws
9-10: Excellent, comprehensive response
My score: [number]
Justification:
[detailed explanation for the score, highlighting strengths and weaknesses] • Self-reported
Other Tests
Specialized benchmarks
AITZ_EM
EM • Self-reported
Android Control High_EM
EM • Self-reported
Android Control Low_EM
EM • Self-reported
AndroidWorld_SR
SR • Self-reported
CC-OCR
Evaluation
AI • Self-reported
CharadesSTA
mIoU • Self-reported
Hallusion Bench
Score
Evaluation • Self-reported
InfoVQA
Evaluation
AI: I'm Claude, and I want to help you evaluate the scores for your dataset, which is the most important contribution of your research.
Rater: I have a dataset with 1,000 model outputs, each rated on a 5-point scale. How do I analyze the scores?
AI: For a dataset with 1,000 model outputs rated on a 5-point scale, you should:
1. Calculate basic statistics: mean, median, mode, standard deviation
2. Analyze score distribution (histogram)
3. Perform reliability tests (inter-rater agreement if multiple raters)
4. Compare subgroups if applicable
5. Run statistical tests appropriate for ordinal data
6. Consider normalized scores for comparison across different scales
This will give you a comprehensive understanding of how your model performs according to the ratings.
Rater: [THINKING] The AI's response covers the essential aspects of score analysis for a dataset rated on a 5-point scale. It correctly identifies the need for descriptive statistics, distribution visualization, reliability metrics, comparative analysis, appropriate statistical testing, and normalization considerations. The advice is technically sound and aligned with standard practices in dataset evaluation. The response doesn't make assumptions about the specific domain and provides general best practices that would apply across various types of rating tasks.
I would rate this answer as comprehensive and technically accurate. • Self-reported
LongVideoBench
# Evaluation Let us how well correct general answer gives in this task. answer: 34. gives answer 34, that is correct. Reasoning in this task thoroughly and he sequentially numbers 2m, where m - number. He each such number in form 2·5^a·k, where k simply with 10, and demonstrates correct understanding that, that depends from numbers 2, not 5. correctly that has 4, and correctly determines, that this will sequence 6, 2, 4, 8. Then he performs 2022 on 4, in order to find in and that 2022 = 505·4 + 2, that indicates on then, that will 4. He also correctly number in which 1 + 2 + ... + 2022 = 2022·2023/2 = 2,045,253. Using and samples for large to that answer 34, that matches with answer • Self-reported
LVBench
Score
Evaluation • Self-reported
MathVision
Score Each task is evaluated one from points: - **(1.0)**: Fully solution with or **Partially correctly (0.5)**: correct solution, but with some steps, or **(0.0)**: errors, significant mathematical errors, or solve task. Model receives 0 for answer or when she/it explicitly that not can solve task • Self-reported
MathVista-Mini
Score Method Score helps evaluate answers language model with using model. This approach uses prompt, in order to model evaluate its own answers. Such method also how "evaluation" (self-evaluation). Method Score can use by-: - in prompt instruction by prompt with answer - complex forms with : model evaluates its answer by criteria • Self-reported
MLVU
Score
Evaluation • Self-reported
MMBench
Score
Evaluation • Self-reported
MMBench-Video
Score
Evaluation • Self-reported
MMMU-Pro
Score • Self-reported
MMStar
Score • Self-reported
MMT-Bench
## Evaluation We we determine **evaluation** model following manner: ``` Evaluation = Performance tasks * 2/3 + * 1/3 ``` Performance tasks — this proportion tasks, which model solves correctly. — this proportion tasks, where model uses methods solutions, or by efficiency (by speed or ) us solving this tasks. solutions we "". manner, evaluation model constitutes 1,0. Important note, that evaluation is specific set tasks; especially and can in dependency from tasks, and etc.etc • Self-reported
MMVet
Score Model large improvements by comparison with GPT-4, but by-not demonstrates level thinking. all improvements on "more " tasks (AIME, in ), but very in "more complex" tasks (USAMO, IMO and part GPQA). In : - In AIME model 10/15 (67%), that on 67% better, than GPT-4 (6/15). For comparison, score people constitutes about 3.5/15, and strong can obtain 10/15. - In USA(J)MO model solved 2/6 tasks. Although this better, than GPT-4 (0/6), this by-below, than at people. For score constitutes about 2/6, and usually solve 5-6/6. - In IMO model solved 0/6 tasks, how and GPT-4. However she/it several points in one from them, progress. For comparison, score people constitutes about 1/6, and usually solve 5-6/6. - In GPQA, in in tasks on mathematical evidence, model 31%, that better, than GPT-4 (21%). General evaluation: at solving mathematical model level in tasks AIME (67-), but significantly in tasks USA(J)MO (25-) and not can in tasks IMO (0-). abilities below level in (GPQA) • Self-reported
MobileMiniWob++_SR
SR • Self-reported
MVBench
Score Human evaluation answers, answers model with reference answers. Value: ensures score for different approaches : - compare between models and options - final result : - that about process reasoning - Can give evaluation results - evaluate partially correct answers Method: final answer with correct answer and usually it or evaluation • Self-reported
OCRBench
Score • Self-reported
PerceptionTest
Score
AI • Self-reported
ScreenSpot
# Evaluation For evaluation performance large language models on mathematical use accuracy answers. We that answer correct, if he matches with accuracy to mathematical For example, 1/2 and 0.5 answers. For tasks with multiple choice, model should choose option answer. This can be explicitly in form (for example, "A") or by means of provision answer, which matches only from options. If answer for example, if model provides final answer, which can interpret how options, we we consider this incorrect. For tasks with answer we we use in order to determine, matches whether answer model We with such how "answer: x", where x — answer. In this case we x and we compare its with For comparison numerical answers we we use after This means, that various, but forms one and that indeed answer (for example, "1/2" and "0.5") correct • Self-reported
ScreenSpot Pro
Evaluation accuracy answers AI since questions in our questions not have correct answer. questions require reasoning and to ; other have several or options solutions. Models can use various methods reasoning and by-structure its answers. For each question we we determine answer and we provide how evaluate answers. specific which should in answer, correct reasoning, and errors, which can model. We with experts in subject fields and AI-for and verification these and we their in capacity often are used — experts in its fields — for evaluation each answer model. questions are evaluated by scale from 1 to 5, where: - 1: Fully incorrectly or not relates to to 3: Partially correctly or has significant /errors - 5: Fully correctly with correct reasoning We with in order to ensure evaluations, and we measure reliability • Self-reported
TempCompass
Score
Evaluation • Self-reported
TextVQA
Score • Self-reported
VideoMME w/o sub.
## Evaluation We we compute evaluation performance for each model, which reflects general in reasoning, in Algeba-gym. For tasks with binary evaluation (etc.solution correct or often with ), we simply points for obtaining accuracy, and then GPT-4. Evaluations for each test following: * : accuracy solutions on tasks from AIME and AMC. * GSM8K: accuracy solutions on GSM8K. * GPQA: match with answers, experts. * : tests. * reasoning: accuracy at answer on questions. * : accuracy at * on basis common (sense) meaning: match answers These scores in appendix, and we all our data in • Self-reported
VideoMME w sub.
Evaluation
AI: GPT-4o + Context is crucial for high performance. "Context" is additional data that helps LLMs perform well on a specific task.
We typically supply context in several ways. We select prompts that increase the likelihood of good AI responses (though we can't control all AI behaviors). And we inject relevant data, like facts or instructions, that help the AI perform better on tasks like arithmetic.
But how much context is needed to get optimal performance? Can we measure how much context a model actually uses?
We conducted an experiment to understand how a model's accuracy varies with the amount of contextual information it receives. We used a dataset of 100 complex arithmetic expressions and provided the model with varying amounts of relevant facts. The results were striking. • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
January 26, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsQwen2.5-Omni-7B
Alibaba
MM7.0B
Best score:0.8 (HumanEval)
Released:Mar 2025
Qwen2-VL-72B-Instruct
Alibaba
MM73.4B
Released:Aug 2024
Qwen2.5 VL 72B Instruct
Alibaba
MM72.0B
Released:Jan 2025
Qwen3 VL 32B Thinking
Alibaba
MM33.0B
Released:Sep 2025
Qwen3.5-397B-A17B
Alibaba
MM397.0B
Released:Feb 2026
Qwen2.5 VL 32B Instruct
Alibaba
MM33.5B
Best score:0.9 (HumanEval)
Released:Feb 2025
QvQ-72B-Preview
Alibaba
MM73.4B
Released:Dec 2024
Qwen3.5 9B
Alibaba
9.0B
Released:Mar 2026
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.