Alibaba logo

Qwen2.5 VL 7B Instruct

Multimodal
Alibaba

Qwen2.5-VL is a multimodal language model from the Qwen family. Key improvements include visual understanding (objects, text, charts, layouts), visual agent capabilities (tool use, computer/phone control), long video understanding with event detection, visual grounding (bounding boxes/points), and structured output generation.

Key Specifications

Parameters
8.3B
Context
-
Release Date
January 26, 2025
Average Score
64.5%

Timeline

Key dates in the model's history
Announcement
January 26, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
8.3B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Multimodal

Working with images and visual data
ChartQA
Score In with evaluation Anthropic, Claude answers numerical evaluation from 1 to 10, where 10 indicates on correct answers, and 1 on fully incorrect. Claude uses logic, computation and its internal knowledge for evaluation correctness answers. mathematical answers, Claude determine, whether answer through correct computation and reasoning. If answer and reasoning fully Claude evaluation 10. If answer but methodology contains some correct steps, Claude evaluation, degree correctness method. If reasoning and answer fully Claude evaluation 1. For cases Claude reasoning, and general solutionsSelf-reported
87.3%
DocVQA
ScoreSelf-reported
95.7%
MMMU
Evaluation AI: I'll generate a number from 1 to 10 to assess the model's response based on quality, relevance, and reasoning: 1-3: Poor response, significant errors or irrelevance 4-6: Adequate response with some issues 7-8: Good response with minor flaws 9-10: Excellent, comprehensive response My score: [number] Justification: [detailed explanation for the score, highlighting strengths and weaknesses]Self-reported
58.6%

Other Tests

Specialized benchmarks
AITZ_EM
EMSelf-reported
81.9%
Android Control High_EM
EMSelf-reported
60.1%
Android Control Low_EM
EMSelf-reported
91.4%
AndroidWorld_SR
SRSelf-reported
25.5%
CC-OCR
Evaluation AISelf-reported
77.8%
CharadesSTA
mIoUSelf-reported
43.6%
Hallusion Bench
Score EvaluationSelf-reported
52.9%
InfoVQA
Evaluation AI: I'm Claude, and I want to help you evaluate the scores for your dataset, which is the most important contribution of your research. Rater: I have a dataset with 1,000 model outputs, each rated on a 5-point scale. How do I analyze the scores? AI: For a dataset with 1,000 model outputs rated on a 5-point scale, you should: 1. Calculate basic statistics: mean, median, mode, standard deviation 2. Analyze score distribution (histogram) 3. Perform reliability tests (inter-rater agreement if multiple raters) 4. Compare subgroups if applicable 5. Run statistical tests appropriate for ordinal data 6. Consider normalized scores for comparison across different scales This will give you a comprehensive understanding of how your model performs according to the ratings. Rater: [THINKING] The AI's response covers the essential aspects of score analysis for a dataset rated on a 5-point scale. It correctly identifies the need for descriptive statistics, distribution visualization, reliability metrics, comparative analysis, appropriate statistical testing, and normalization considerations. The advice is technically sound and aligned with standard practices in dataset evaluation. The response doesn't make assumptions about the specific domain and provides general best practices that would apply across various types of rating tasks. I would rate this answer as comprehensive and technically accurate.Self-reported
82.6%
LongVideoBench
# Evaluation Let us how well correct general answer gives in this task. answer: 34. gives answer 34, that is correct. Reasoning in this task thoroughly and he sequentially numbers 2m, where m - number. He each such number in form 2·5^a·k, where k simply with 10, and demonstrates correct understanding that, that depends from numbers 2, not 5. correctly that has 4, and correctly determines, that this will sequence 6, 2, 4, 8. Then he performs 2022 on 4, in order to find in and that 2022 = 505·4 + 2, that indicates on then, that will 4. He also correctly number in which 1 + 2 + ... + 2022 = 2022·2023/2 = 2,045,253. Using and samples for large to that answer 34, that matches with answerSelf-reported
54.7%
LVBench
Score EvaluationSelf-reported
45.3%
MathVision
Score Each task is evaluated one from points: - **(1.0)**: Fully solution with or **Partially correctly (0.5)**: correct solution, but with some steps, or **(0.0)**: errors, significant mathematical errors, or solve task. Model receives 0 for answer or when she/it explicitly that not can solve taskSelf-reported
25.1%
MathVista-Mini
Score Method Score helps evaluate answers language model with using model. This approach uses prompt, in order to model evaluate its own answers. Such method also how "evaluation" (self-evaluation). Method Score can use by-: - in prompt instruction by prompt with answer - complex forms with : model evaluates its answer by criteriaSelf-reported
68.2%
MLVU
Score EvaluationSelf-reported
70.2%
MMBench
Score EvaluationSelf-reported
84.3%
MMBench-Video
Score EvaluationSelf-reported
1.8%
MMMU-Pro
ScoreSelf-reported
38.3%
MMStar
ScoreSelf-reported
63.9%
MMT-Bench
## Evaluation We we determine **evaluation** model following manner: ``` Evaluation = Performance tasks * 2/3 + * 1/3 ``` Performance tasks — this proportion tasks, which model solves correctly. — this proportion tasks, where model uses methods solutions, or by efficiency (by speed or ) us solving this tasks. solutions we "". manner, evaluation model constitutes 1,0. Important note, that evaluation is specific set tasks; especially and can in dependency from tasks, and etc.etcSelf-reported
63.6%
MMVet
Score Model large improvements by comparison with GPT-4, but by-not demonstrates level thinking. all improvements on "more " tasks (AIME, in ), but very in "more complex" tasks (USAMO, IMO and part GPQA). In : - In AIME model 10/15 (67%), that on 67% better, than GPT-4 (6/15). For comparison, score people constitutes about 3.5/15, and strong can obtain 10/15. - In USA(J)MO model solved 2/6 tasks. Although this better, than GPT-4 (0/6), this by-below, than at people. For score constitutes about 2/6, and usually solve 5-6/6. - In IMO model solved 0/6 tasks, how and GPT-4. However she/it several points in one from them, progress. For comparison, score people constitutes about 1/6, and usually solve 5-6/6. - In GPQA, in in tasks on mathematical evidence, model 31%, that better, than GPT-4 (21%). General evaluation: at solving mathematical model level in tasks AIME (67-), but significantly in tasks USA(J)MO (25-) and not can in tasks IMO (0-). abilities below level in (GPQA)Self-reported
67.1%
MobileMiniWob++_SR
SRSelf-reported
91.4%
MVBench
Score Human evaluation answers, answers model with reference answers. Value: ensures score for different approaches : - compare between models and options - final result : - that about process reasoning - Can give evaluation results - evaluate partially correct answers Method: final answer with correct answer and usually it or evaluationSelf-reported
69.6%
OCRBench
ScoreSelf-reported
86.4%
PerceptionTest
Score AISelf-reported
70.5%
ScreenSpot
# Evaluation For evaluation performance large language models on mathematical use accuracy answers. We that answer correct, if he matches with accuracy to mathematical For example, 1/2 and 0.5 answers. For tasks with multiple choice, model should choose option answer. This can be explicitly in form (for example, "A") or by means of provision answer, which matches only from options. If answer for example, if model provides final answer, which can interpret how options, we we consider this incorrect. For tasks with answer we we use in order to determine, matches whether answer model We with such how "answer: x", where x — answer. In this case we x and we compare its with For comparison numerical answers we we use after This means, that various, but forms one and that indeed answer (for example, "1/2" and "0.5") correctSelf-reported
84.7%
ScreenSpot Pro
Evaluation accuracy answers AI since questions in our questions not have correct answer. questions require reasoning and to ; other have several or options solutions. Models can use various methods reasoning and by-structure its answers. For each question we we determine answer and we provide how evaluate answers. specific which should in answer, correct reasoning, and errors, which can model. We with experts in subject fields and AI-for and verification these and we their in capacity often are used — experts in its fields — for evaluation each answer model. questions are evaluated by scale from 1 to 5, where: - 1: Fully incorrectly or not relates to to 3: Partially correctly or has significant /errors - 5: Fully correctly with correct reasoning We with in order to ensure evaluations, and we measure reliabilitySelf-reported
29.0%
TempCompass
Score EvaluationSelf-reported
71.7%
TextVQA
ScoreSelf-reported
84.9%
VideoMME w/o sub.
## Evaluation We we compute evaluation performance for each model, which reflects general in reasoning, in Algeba-gym. For tasks with binary evaluation (etc.solution correct or often with ), we simply points for obtaining accuracy, and then GPT-4. Evaluations for each test following: * : accuracy solutions on tasks from AIME and AMC. * GSM8K: accuracy solutions on GSM8K. * GPQA: match with answers, experts. * : tests. * reasoning: accuracy at answer on questions. * : accuracy at * on basis common (sense) meaning: match answers These scores in appendix, and we all our data inSelf-reported
65.1%
VideoMME w sub.
Evaluation AI: GPT-4o + Context is crucial for high performance. "Context" is additional data that helps LLMs perform well on a specific task. We typically supply context in several ways. We select prompts that increase the likelihood of good AI responses (though we can't control all AI behaviors). And we inject relevant data, like facts or instructions, that help the AI perform better on tasks like arithmetic. But how much context is needed to get optimal performance? Can we measure how much context a model actually uses? We conducted an experiment to understand how a model's accuracy varies with the amount of contextual information it receives. We used a dataset of 100 complex arithmetic expressions and provided the model with varying amounts of relevant facts. The results were striking.Self-reported
71.6%

License & Metadata

License
apache_2_0
Announcement Date
January 26, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.