Qwen2.5-Omni-7B
MultimodalQwen2.5-Omni is the flagship end-to-end multimodal model in the Qwen series. It processes diverse inputs including text, images, audio, and video, providing real-time streaming responses through text generation and natural speech synthesis using the novel Thinker-Talker architecture.
Key Specifications
Parameters
7.0B
Context
-
Release Date
March 27, 2025
Average Score
59.2%
Timeline
Key dates in the model's history
Announcement
March 27, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
7.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
Programming
Programming skills tests
HumanEval
## Evaluation For determination abilities and models, I and methods evaluation. ### evaluation (numerical evaluation) **for:** - specific correct answers - measurement general performance - scores between models **Process:** - Definition (usually 0-5 or 0-10) - points on basis accuracy and completeness answers - Computation scores by various tasks ### evaluation (analysis) **for:** - Evaluations reasoning and processes thinking - limitations and strong sides models **Process:** - analysis approaches model to solving problems - errors and Evaluation model and abilities to This methodology helps create understanding capabilities model, for simple scores accuracy • Self-reported
MBPP
Score
Evaluation • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
Score
Evaluation • Self-reported
MATH
Score
Evaluation • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Evaluation results are evaluated by 7-scale, on MMLU and by comparison with first evaluation, for evaluation capabilities. This was in work with : 1: incorrectly — in answer, which not has to type analysis. 2: In mainly incorrectly — contains critically incorrect statements or demonstrates understanding type analysis. 3: incorrectly — contains incorrect conclusions, but demonstrates understanding type analysis. 4: — contains some and some incorrect statements, not achieving output. 5: correctly — correctly, but with some or 6: In mainly correctly — majority aspects type analysis with minor or 7: Fully correctly — demonstrates full understanding and execution type analysis. : For evaluations tasks (for example, ) evaluation 5 and above • Self-reported
Multimodal
Working with images and visual data
AI2D
Evaluation AI: GPT-4o Human: I that you only one "Evaluation". In following times, if I only text, all text for translation. this text: # Score Our scoring criterion, based on previous work, consists of two main components: accuracy on problems within the model's "comfort zone", and extent of that comfort zone. For each problem, we collect the evidence of how capable the model is of solving problems of that level of difficulty. Each model is assigned to the highest level at which it can solve problems with at least 65% accuracy (this threshold is arbitrary; see Ajeya Cotra's work from 2020 for more discussion). We include more details in the Appendix. The score is a composite metric reflecting both the highest level that a model can reach with 65%+ accuracy, as well as its performance on earlier problems. It is computed as a weighted average, with weights 5, 4, 3, 2, 1 for Levels 1, 2, 3, 4, 5 respectively. This reflects our judgment that solving simpler problems is more important than solving harder ones. For example, Claude 3 Opus scores 5 on Level 1 and Level 2, ~2.75 on Level 3, and 0 on Levels 4 and 5. Its overall score is: (5 * 5 + 4 * 5 + 3 * 2.75 + 2 * 0 + 1 * 0) / (5 + 4 + 3 + 2 + 1) = 67.25 / 15 = ~4.48 We believe that our methodology allows us to make a reasonable assessment of the capabilities of different LLMs. However, there is still room for improvement. For example, in the future, we aim to move towards automated scoring, implement bootstrapping for uncertainty intervals, and attempt more rigorous testing for model capabilities beyond Level 5. For clarity, we report scores in the range of 1-5, rounded to 1 decimal place. Scores above 4.5 indicate mastery of Level 1 and Level 2 problems. A score of 5 would represent mastery of problems at all difficulty levels, which no model has achieved to date • Self-reported
ChartQA
Score In this we we evaluate general results for each model. We simply each assignments in with by If points for assignments not we we consider each assignment For with several such how AIME and FrontierMath, we we use for each parts at general points. (So, for example, AIME I receives with AIME II, and FrontierMath parts ) We we use evaluation, how above; we not we use metric "points", at which is considered best result from 3 • Self-reported
DocVQA
Score • Self-reported
MathVista
Score • Self-reported
MMMU
Evaluation AI: how would you answer model on question from mathematical evaluation should be for answer how correct, partially correct or ? various points view and which can • Self-reported
Other Tests
Specialized benchmarks
Common Voice 15
WER In context artificial intelligence and processing language (NLP), WER usually means "Word Error Rate" (errors on level words). This metric, used for evaluation quality or systems translation. WER measures between (reference) text and text, in or generation. She/It is calculated how and words to number words in text. : WER = (S + D + I) / N : S = number words D = number words I = number words N = general number words in text value WER indicates on quality or generation, and result (0%) means full match between and reference • Self-reported
CoVoST2 en-zh
BLEU evaluation without (BLEU) — this metric, used for evaluation quality machine translation. She/It measures between and one or several reference BLEU based on accuracy n-and for Evaluation from 0 to 1, where more values indicate on quality translation. BLEU is considered metric for evaluation systems machine translation and often is used in research for comparison various models translation. However at him is limitations, since he not accounts for differences and can not always with evaluationquality • Self-reported
CRPErelation
Score • Self-reported
EgoSchema
# Score This metric for measurement abilities model provide exact answers, when she/it information In model should answer on queries about and well answers. **How we we measure:** We we evaluate answers model on 200 diverse questions about from our set for verification accuracy. These questions various field, including general knowledge and Each answer model is evaluated answer by scale from 0 to 5 (from to ) on basis completeness, accuracy and **Strong and weak side:** This metric is important score model for queries. However she/it not measures more quality information, such how nuances or context • Self-reported
FLEURS
WER • Self-reported
GiantSteps Tempo
Evaluation
AI: GPT-4 is now essentially the default, at least for many people. Let me just quickly show you a few of the ways that GPT-4 is smarter than GPT-3.5. I'm going to show MMLU performance and a new metric called GPQA which is a very challenging dataset to measure GPT-4 and GPT-3.5. And I'm also going to show GSM-8K, which is a graduate level math reasoning benchmark. • Self-reported
LiveBench
Score • Self-reported
MathVision
Score System for each solutions works following manner: We solution on steps thinking, and each step receives evaluation 1, 0 or -1, in dependency from that, is whether step correct, or incorrect. Example tasks on : "Solve: (2 × 3) + (4 × 5)" Solution 1: "First (2 × 3) = 6, then (4 × 5) = 20, and finally 6 + 20 = 26". → step, all correct, therefore receives evaluation 3. Solution 2: "First (2 × 3) = 6, then (4 × 5) = 20, and finally 6 + 20 = 25". → correct step and one incorrect step, therefore receives evaluation 1. Solution 3: "By (2 × 3) + (4 × 5) = 2 × 3 + 4 × 5 = 2 × 7 × 5 = 14 × 5 = 70". → errors, evaluation For we all solutions to evaluation 0 • Self-reported
Meld
Evaluation
AI: 0 • Self-reported
MMAU
Score
AI: Evaluation • Self-reported
MMAU Music
Evaluation
AI • Self-reported
MMAU Sound
# Score ## Score — this method for and reasoning about one example by means of and example on several ## Strong side - Allows model step by step understanding problems, that Especially efficient for tasks, requiring several logical steps. - ability model with tasks, approaches. ## limitations - Can more time, than direct approach, especially for simple tasks. - from abilities model correctly break down problem on subtasks. - In some cases can lead to to or on main ## Example use ### task: At and together 21 At in 2 times more than at at ? ### Application method Score: 1. number at how A, and at how B 2. By A + B = 21 3. that A = 2B 4. A = 2B in first equation: 2B + B = 21 5. : 3B = 21 6. for B: B = 7 7. A = 2B = 2 × 7 = 14 Answer: At 14 • Self-reported
MMAU Speech
Score AI-— this field which on artificial intelligence. AI depends from metrics, quality and models. in should be: • to performance and specific which we measure • use • high ability between different capabilities • standard tools and context testing • and for • to training on test examples benchmarks, such how MMLU, GPQA and other, are tasks, which model can and on quickly. tools, which measurement in AI. In research Anthropic Score — system evaluation models, which automatically new and tasks, evaluation capabilities model • Self-reported
MMBench-V1.1
Score • Self-reported
MME-RealWorld
## Evaluation **Evaluation** measures quality execution model on task. set various evaluation, for measurement efficiency model. Some from metrics include: - **Accuracy**: proportion correct answers from general numbers answers. - **LogProb-accuracy**: probability, model correct answer. For models, method logprob-accuracy is training, on which model, and can that she/it will on test data. - **F1-evaluation**: harmonic average between accuracy and Models can evaluate with points view how **accuracy**, so and ****. Accuracy evaluates, is whether result model actually correct. evaluates, whether result so, how if would he was use tests, we also we measure performance model, experts-people for evaluation quality model in various such how accuracy and When results benchmarks in general evaluation, we we evaluate behavior models on test sets on and on test sets can in general categories (or «»). For example, can create general evaluation "mathematics", which accounts for performance in many test sets by mathematics. These general results, how give evaluation that, how model works with tasks (for example, reasoning, ), and more than results on one • Self-reported
MMLU-Pro
Score AI2 Reasoning Challenge (ARC) provides set from approximately 7,787 scientific questions and level. Questions on (5,197) and complex (2,590), complex questions require more reasoning for solutions. Set consists from questions with multiple choice, where each question options answers. We we evaluate model by accuracy on all set data and set. For majority accuracy is calculated how number correct answers, on general number questions. code: https://github.com/allenai/arc Set data: https://allenai.org/data/arc • Self-reported
MMLU-Redux
Evaluation
AI • Self-reported
MM-MT-Bench
## Evaluation We we use 10-for evaluation abilities reason, : 1. : information and applies corresponding methods, in order to answer. 2. : evaluates and approaches, and possible errors. 3. : course thinking, structure, justification and key **:** * 1-2: understanding or incorrect reasoning * 3-4: understanding with * 5-6: reasoning * 7-8: Good reasoning with some * 9-10: reasoning with evaluate each answer, and then to evaluation. In order to we we analyze answers all • Self-reported
MMMU-Pro
Evaluation AI: For each tasks we whether work model evaluation correctly/incorrectly (1/0), or can points. For example, if model receives correct answer, but her/its reasoning contain error, we we can 0,5 points. Human: In order to evaluate answers we we use system evaluation. She/It accounts for how final answer, so and process solutions. We application concepts and If to correct answer with errors in reasoning, we often score. If reasoning correct, but final answer incorrect from-for errors, we also score • Self-reported
MMStar
# Evaluation For each assignments solution. Some solutions from large language models (LLM) can be In such cases we evaluate solutions by to Correct approach can obtain points. In GPQA we we use five evaluations: - Fully Solution and correctly. 5 errors. Solution correct, but contains errors, for example computational errors, which not on general approach. 4 Partially correct but solution not or contains significant errors. 3 with Solution demonstrates understanding problems, but contains errors. 2 incorrectly. Solution fully 1 Without solutions. Not attempts solve task. 0 LLM usually receive not more 3 points for majority tasks, that means, that they often problem and approach, but with at execution • Self-reported
MuirBench
Score • Self-reported
MultiPL-E
Score • Self-reported
MusicCaps
Score
Evaluation • Self-reported
MVBench
Score Score — this metric evaluation from 1 to 5, which measures, how well well LLM handles with tasks. Evaluation on three : 1. tasks: determines whether model task and key concepts? 2. and reasoning: uses whether model mathematical approaches for solutions tasks? 3. Accuracy and correctness: whether model to correct answer? Evaluation 5 means, that model correctly task and approach, and correct answer. Evaluation 1 means, that model incorrectly task, not mathematical reasoning and not correct answer • Self-reported
NMOS
NMOS model set (Neural Model of Open Sets, NMOS) — this structure, for execution tasks classification with set (open-set classification). Task classification set requires, in order to model whether query to in time training, or to not in training data. NMOS to this task, progress in field training with (contrastive representation learning) with more NMOS for data in where examples from one and that indeed to and examples from different — from Then for each on basis this This allows NMOS inside and ensures more method for determination to During time output NMOS determines, whether query to probability its to each from and applies value for differences between and • Self-reported
OCRBench_V2
# Evaluation Evaluation - this method testing, at which quality abilities model is evaluated by means of comparison her/its answer with set samples for evaluation, which in capacity or "". ## Description In evaluation model solves task, and then answer is evaluated by set which presented in form samples for evaluation. These samples can be presented in form: 1. **** - Model receives evaluation, for example, from 1 to 10, in dependency from that, how well well she/it assignment. 2. **** - answers from best to 3. **** - or more answers with other. 4. **** - set for evaluation various aspects answer. 5. **** - best answer from several options. ## Advantages - This method allows evaluate performance model and compare her/its with other models in specific tasks or abilities. - He can for evaluation, if criteria evaluation can be When use he can provide and evaluation performance model. ## Disadvantages - Some methods evaluation require human and not can be fully Quality evaluation directly depends from quality evaluation. - Evaluations can be and between different especially for complex tasks. ## Application - Evaluation models machine training - Comparison different one model - Evaluation in specific task with time - strong and sides model • Self-reported
ODinW
**Evaluation** In its analysis we strict evaluation abilities model, when this possible, in order to ensure exact and understanding her/its capabilities. This our methods evaluation for various aspects performance. **evaluation** **and programming** For mathematical and tasks our evaluation on binary correctness. Answer is considered correct, if he exactly matches or it by **Understanding and reasoning** For tasks, requiring understanding and reasoning, we we evaluate answers by scale from 0 to 5, where: - 5: Fully correct answer with reasoning - 4: In mainly correct with minor errors or 3: Partially correct with some 2: In mainly incorrect, but with some correct 1: Fully incorrect with 0: Answer or not relates to to **analysis** For more deep understanding we often we compare performance model with in its models. This allows us determine, where she/it outperforms, matches or from level in various fields. **evaluation** In addition to we analysis model, such how: - **reasoning**: How well model complex problems - **Accuracy**: How well factual information, model - ****: How well well and answers model - ****: How well well model to tasks and ****: and approaches, model This approach to evaluation allows us full and capabilities model, for simple metrics correctly/incorrectly • Self-reported
OmniBench
Evaluation AI-2 ## Methodology We we analyze 12 tasks from trials FrontierMath. For each tasks we we consider various measurement: 1. ****: Evaluation 0-5 points, where 5 means, that answer fully correct, and 0 means, that answer fully incorrect or not 2. ****: Evaluation 0-5 points, where 5 means, that all steps and 0 means, that answer not has explanations. 3. ****: Evaluation 0-5 points, where 5 means, that all steps and and 0 means, that work contains mathematical errors. 4. **strategies solutions**: Evaluation 0-5 points, where 5 means high in solving, and 0 means, that work not demonstrates thinking. 5. **Use modern tools or approaches**: evaluation that, uses whether LLM modern tools or approaches, such how verification evidence, systems or new approaches. ## General evaluation For each tasks we we determine for obtaining general evaluation • Self-reported
OmniBench Music
Score In this work we several key scores for evaluation quality models in complex mathematical fields. **Accuracy first ** is calculated how proportion tasks, where first answer model correct. This measures ability model find solution with first attempts. **Accuracy ** is determined how proportion tasks, where answer correct. This score reflects ability model correct errors and its solution. **from prompts** measures, how well performance model depends from specific prompts. This model to tasks and her/its execution. **Accuracy by ** allows performance model in dependency from complexity tasks, strong and weak field. **Use intermediate steps** evaluates model process solutions and between this and accuracy • Self-reported
PointGrounding
Score answers model by scale from 0 to 10, how well exactly and correctly answer solves task. Evaluation 10: Answer fully matches tasks, including all necessary components, in format, without errors and with solution. Evaluation 8-9: Answer in mainly matches tasks, contains all necessary components, with minor (for example, solution or in ). Evaluation 6-7: Answer tasks, but contains several all significant components solutions, but some less important components can or be presented Evaluation 4-5: Answer has but answers tasks. Answer can be or errors, but demonstrates some knowledge, to Evaluation 2-3: Answer in mainly not matches tasks, but contains although would one correct answer. Evaluation 0-1: Answer fully or fully not matches tasks, not contains correct answer or fully • Self-reported
RealWorldQA
Evaluation
AI • Self-reported
TextVQA
Evaluation AI: I Claude and GPT-4 on various tasks in mathematics, and in order to evaluate their ability reason and find solutions. tasks, requiring how basic skills, so and tasks, requiring knowledge from different fields, and also standard tasks from mathematical competitions (for example, AIME, FrontierMath). I also their ability to work, between AI. In order to measure performance, I how so and approaches: their answers, but also ability solve various tasks with using correct methods. I system evaluation: "" (task correctly with using correct methods), "Partially" (task with but were errors) and "" (solution or understanding). I tasks, which reasoning, and not tasks, which could be by knowledge. Tasks were on three categories: simple (can be for 2-3 step), (require complex concepts, several steps) and complex (require deep understanding and several that). Each task has evaluation solutions and explanations • Self-reported
VideoMME w sub.
## Evaluation TIES (system evaluation) — this system, which tries and various (for example, answers model and reference answers) with help three processing: 1. ****: we we use LLM for extraction set facts from each text. 2. ****: we we compute exact and between each facts. 3. **Evaluation**: using these we we compute general evaluation between text and reference answer. TIES achieves with evaluationhuman-without settings on specific assignments or evaluation. Instead this he requires only answer for each question. General evaluation indicates on degree, in which (for example, answer on question) contains indeed information, that and answer, from in or • Self-reported
VocalSound
Score • Self-reported
VoiceBench Avg
Evaluation AI: *evaluates answer how correct or incorrect* for understanding abilities model correct answers on questions, but not evaluates course reasoning, quality output or confidence model. Usually is used with: - with choice answer - with answers - classification - tasks, where can clearly determine correct answer Strong side: side: model not evaluates understanding, create about performance model • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
March 27, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsQwen2.5 VL 7B Instruct
Alibaba
MM8.3B
Released:Jan 2025
Qwen2.5 7B Instruct
Alibaba
7.6B
Best score:0.8 (HumanEval)
Released:Sep 2024
Price:$0.30/1M tokens
Qwen2 7B Instruct
Alibaba
7.6B
Best score:0.8 (HumanEval)
Released:Jul 2024
Gemma 3n E2B Instructed
MM8.0B
Best score:0.7 (HumanEval)
Released:Jun 2025
Gemma 3n E2B
MM8.0B
Best score:0.5 (ARC)
Released:Jun 2025
Gemma 3 4B
MM4.0B
Best score:0.7 (HumanEval)
Released:Mar 2025
Price:$0.02/1M tokens
Gemma 3n E2B Instructed LiteRT (Preview)
MM1.9B
Best score:0.7 (HumanEval)
Released:May 2025
Gemma 3n E4B Instructed
MM8.0B
Best score:0.8 (HumanEval)
Released:Jun 2025
Price:$20.00/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.