Qwen2.5-Omni-7B

Name: Qwen2.5-Omni-7B
Author: Alibaba

Multimodal

Alibaba

Qwen2.5-Omni is the flagship end-to-end multimodal model in the Qwen series. It processes diverse inputs including text, images, audio, and video, providing real-time streaming responses through text generation and natural speech synthesis using the novel Thinker-Talker architecture.

Key Specifications

Parameters

7.0B

Context

Release Date

March 27, 2025

Average Score

59.2%

Research Paper Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

March 27, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

7.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests

HumanEval

## Evaluation For determination abilities and models, I and methods evaluation. ### evaluation (numerical evaluation) **for:** - specific correct answers - measurement general performance - scores between models **Process:** - Definition (usually 0-5 or 0-10) - points on basis accuracy and completeness answers - Computation scores by various tasks ### evaluation (analysis) **for:** - Evaluations reasoning and processes thinking - limitations and strong sides models **Process:** - analysis approaches model to solving problems - errors and Evaluation model and abilities to This methodology helps create understanding capabilities model, for simple scores accuracy • Self-reported

78.7%

MBPP

Score Evaluation • Self-reported

73.2%

Mathematics

Mathematical problems and computations

GSM8k

Score Evaluation • Self-reported

88.7%

MATH

Score Evaluation • Self-reported

71.5%

Reasoning

Logical reasoning and analysis

GPQA

Evaluation results are evaluated by 7-scale, on MMLU and by comparison with first evaluation, for evaluation capabilities. This was in work with : 1: incorrectly — in answer, which not has to type analysis. 2: In mainly incorrectly — contains critically incorrect statements or demonstrates understanding type analysis. 3: incorrectly — contains incorrect conclusions, but demonstrates understanding type analysis. 4: — contains some and some incorrect statements, not achieving output. 5: correctly — correctly, but with some or 6: In mainly correctly — majority aspects type analysis with minor or 7: Fully correctly — demonstrates full understanding and execution type analysis. : For evaluations tasks (for example, ) evaluation 5 and above • Self-reported

30.8%

Multimodal

Working with images and visual data

AI2D

Evaluation AI: GPT-4o Human: I that you only one "Evaluation". In following times, if I only text, all text for translation. this text: # Score Our scoring criterion, based on previous work, consists of two main components: accuracy on problems within the model's "comfort zone", and extent of that comfort zone. For each problem, we collect the evidence of how capable the model is of solving problems of that level of difficulty. Each model is assigned to the highest level at which it can solve problems with at least 65% accuracy (this threshold is arbitrary; see Ajeya Cotra's work from 2020 for more discussion). We include more details in the Appendix. The score is a composite metric reflecting both the highest level that a model can reach with 65%+ accuracy, as well as its performance on earlier problems. It is computed as a weighted average, with weights 5, 4, 3, 2, 1 for Levels 1, 2, 3, 4, 5 respectively. This reflects our judgment that solving simpler problems is more important than solving harder ones. For example, Claude 3 Opus scores 5 on Level 1 and Level 2, ~2.75 on Level 3, and 0 on Levels 4 and 5. Its overall score is: (5 * 5 + 4 * 5 + 3 * 2.75 + 2 * 0 + 1 * 0) / (5 + 4 + 3 + 2 + 1) = 67.25 / 15 = ~4.48 We believe that our methodology allows us to make a reasonable assessment of the capabilities of different LLMs. However, there is still room for improvement. For example, in the future, we aim to move towards automated scoring, implement bootstrapping for uncertainty intervals, and attempt more rigorous testing for model capabilities beyond Level 5. For clarity, we report scores in the range of 1-5, rounded to 1 decimal place. Scores above 4.5 indicate mastery of Level 1 and Level 2 problems. A score of 5 would represent mastery of problems at all difficulty levels, which no model has achieved to date • Self-reported

83.2%

ChartQA

Score In this we we evaluate general results for each model. We simply each assignments in with by If points for assignments not we we consider each assignment For with several such how AIME and FrontierMath, we we use for each parts at general points. (So, for example, AIME I receives with AIME II, and FrontierMath parts ) We we use evaluation, how above; we not we use metric "points", at which is considered best result from 3 • Self-reported

85.3%

DocVQA

Score • Self-reported

95.2%

MathVista

Score • Self-reported

67.9%

MMMU

Evaluation AI: how would you answer model on question from mathematical evaluation should be for answer how correct, partially correct or ? various points view and which can • Self-reported

59.2%

Other Tests

Specialized benchmarks

Common Voice 15

WER In context artificial intelligence and processing language (NLP), WER usually means "Word Error Rate" (errors on level words). This metric, used for evaluation quality or systems translation. WER measures between (reference) text and text, in or generation. She/It is calculated how and words to number words in text. : WER = (S + D + I) / N : S = number words D = number words I = number words N = general number words in text value WER indicates on quality or generation, and result (0%) means full match between and reference • Self-reported

7.6%

CoVoST2 en-zh

BLEU evaluation without (BLEU) — this metric, used for evaluation quality machine translation. She/It measures between and one or several reference BLEU based on accuracy n-and for Evaluation from 0 to 1, where more values indicate on quality translation. BLEU is considered metric for evaluation systems machine translation and often is used in research for comparison various models translation. However at him is limitations, since he not accounts for differences and can not always with evaluationquality • Self-reported

41.4%

CRPErelation

Score • Self-reported

76.5%

EgoSchema

# Score This metric for measurement abilities model provide exact answers, when she/it information In model should answer on queries about and well answers. **How we we measure:** We we evaluate answers model on 200 diverse questions about from our set for verification accuracy. These questions various field, including general knowledge and Each answer model is evaluated answer by scale from 0 to 5 (from to ) on basis completeness, accuracy and **Strong and weak side:** This metric is important score model for queries. However she/it not measures more quality information, such how nuances or context • Self-reported

68.6%

FLEURS

WER • Self-reported

4.1%

GiantSteps Tempo

Evaluation AI: GPT-4 is now essentially the default, at least for many people. Let me just quickly show you a few of the ways that GPT-4 is smarter than GPT-3.5. I'm going to show MMLU performance and a new metric called GPQA which is a very challenging dataset to measure GPT-4 and GPT-3.5. And I'm also going to show GSM-8K, which is a graduate level math reasoning benchmark. • Self-reported

88.0%

LiveBench

Score • Self-reported

29.6%

MathVision

Score System for each solutions works following manner: We solution on steps thinking, and each step receives evaluation 1, 0 or -1, in dependency from that, is whether step correct, or incorrect. Example tasks on : "Solve: (2 × 3) + (4 × 5)" Solution 1: "First (2 × 3) = 6, then (4 × 5) = 20, and finally 6 + 20 = 26". → step, all correct, therefore receives evaluation 3. Solution 2: "First (2 × 3) = 6, then (4 × 5) = 20, and finally 6 + 20 = 25". → correct step and one incorrect step, therefore receives evaluation 1. Solution 3: "By (2 × 3) + (4 × 5) = 2 × 3 + 4 × 5 = 2 × 7 × 5 = 14 × 5 = 70". → errors, evaluation For we all solutions to evaluation 0 • Self-reported

25.0%

Meld

Evaluation AI: 0 • Self-reported

57.0%

MMAU

Score AI: Evaluation • Self-reported

65.6%

MMAU Music

Evaluation AI • Self-reported

69.2%

MMAU Sound

# Score ## Score — this method for and reasoning about one example by means of and example on several ## Strong side - Allows model step by step understanding problems, that Especially efficient for tasks, requiring several logical steps. - ability model with tasks, approaches. ## limitations - Can more time, than direct approach, especially for simple tasks. - from abilities model correctly break down problem on subtasks. - In some cases can lead to to or on main ## Example use ### task: At and together 21 At in 2 times more than at at ? ### Application method Score: 1. number at how A, and at how B 2. By A + B = 21 3. that A = 2B 4. A = 2B in first equation: 2B + B = 21 5. : 3B = 21 6. for B: B = 7 7. A = 2B = 2 × 7 = 14 Answer: At 14 • Self-reported

67.9%

MMAU Speech

Score AI-— this field which on artificial intelligence. AI depends from metrics, quality and models. in should be: • to performance and specific which we measure • use • high ability between different capabilities • standard tools and context testing • and for • to training on test examples benchmarks, such how MMLU, GPQA and other, are tasks, which model can and on quickly. tools, which measurement in AI. In research Anthropic Score — system evaluation models, which automatically new and tasks, evaluation capabilities model • Self-reported

59.8%

MMBench-V1.1

Score • Self-reported

81.8%

MME-RealWorld

## Evaluation **Evaluation** measures quality execution model on task. set various evaluation, for measurement efficiency model. Some from metrics include: - **Accuracy**: proportion correct answers from general numbers answers. - **LogProb-accuracy**: probability, model correct answer. For models, method logprob-accuracy is training, on which model, and can that she/it will on test data. - **F1-evaluation**: harmonic average between accuracy and Models can evaluate with points view how **accuracy**, so and ****. Accuracy evaluates, is whether result model actually correct. evaluates, whether result so, how if would he was use tests, we also we measure performance model, experts-people for evaluation quality model in various such how accuracy and When results benchmarks in general evaluation, we we evaluate behavior models on test sets on and on test sets can in general categories (or «»). For example, can create general evaluation "mathematics", which accounts for performance in many test sets by mathematics. These general results, how give evaluation that, how model works with tasks (for example, reasoning, ), and more than results on one • Self-reported

61.6%

MMLU-Pro

Score AI2 Reasoning Challenge (ARC) provides set from approximately 7,787 scientific questions and level. Questions on (5,197) and complex (2,590), complex questions require more reasoning for solutions. Set consists from questions with multiple choice, where each question options answers. We we evaluate model by accuracy on all set data and set. For majority accuracy is calculated how number correct answers, on general number questions. code: https://github.com/allenai/arc Set data: https://allenai.org/data/arc • Self-reported

47.0%

MMLU-Redux

Evaluation AI • Self-reported

71.0%

MM-MT-Bench

## Evaluation We we use 10-for evaluation abilities reason, : 1. : information and applies corresponding methods, in order to answer. 2. : evaluates and approaches, and possible errors. 3. : course thinking, structure, justification and key **:** * 1-2: understanding or incorrect reasoning * 3-4: understanding with * 5-6: reasoning * 7-8: Good reasoning with some * 9-10: reasoning with evaluate each answer, and then to evaluation. In order to we we analyze answers all • Self-reported

6.0%

MMMU-Pro

Evaluation AI: For each tasks we whether work model evaluation correctly/incorrectly (1/0), or can points. For example, if model receives correct answer, but her/its reasoning contain error, we we can 0,5 points. Human: In order to evaluate answers we we use system evaluation. She/It accounts for how final answer, so and process solutions. We application concepts and If to correct answer with errors in reasoning, we often score. If reasoning correct, but final answer incorrect from-for errors, we also score • Self-reported

36.6%

MMStar

# Evaluation For each assignments solution. Some solutions from large language models (LLM) can be In such cases we evaluate solutions by to Correct approach can obtain points. In GPQA we we use five evaluations: - Fully Solution and correctly. 5 errors. Solution correct, but contains errors, for example computational errors, which not on general approach. 4 Partially correct but solution not or contains significant errors. 3 with Solution demonstrates understanding problems, but contains errors. 2 incorrectly. Solution fully 1 Without solutions. Not attempts solve task. 0 LLM usually receive not more 3 points for majority tasks, that means, that they often problem and approach, but with at execution • Self-reported

64.0%

MuirBench

Score • Self-reported

59.2%

MultiPL-E

Score • Self-reported

65.8%

MusicCaps

Score Evaluation • Self-reported

32.8%

MVBench

Score Score — this metric evaluation from 1 to 5, which measures, how well well LLM handles with tasks. Evaluation on three : 1. tasks: determines whether model task and key concepts? 2. and reasoning: uses whether model mathematical approaches for solutions tasks? 3. Accuracy and correctness: whether model to correct answer? Evaluation 5 means, that model correctly task and approach, and correct answer. Evaluation 1 means, that model incorrectly task, not mathematical reasoning and not correct answer • Self-reported

70.3%

NMOS

NMOS model set (Neural Model of Open Sets, NMOS) — this structure, for execution tasks classification with set (open-set classification). Task classification set requires, in order to model whether query to in time training, or to not in training data. NMOS to this task, progress in field training with (contrastive representation learning) with more NMOS for data in where examples from one and that indeed to and examples from different — from Then for each on basis this This allows NMOS inside and ensures more method for determination to During time output NMOS determines, whether query to probability its to each from and applies value for differences between and • Self-reported

4.5%

OCRBench_V2

# Evaluation Evaluation - this method testing, at which quality abilities model is evaluated by means of comparison her/its answer with set samples for evaluation, which in capacity or "". ## Description In evaluation model solves task, and then answer is evaluated by set which presented in form samples for evaluation. These samples can be presented in form: 1. **** - Model receives evaluation, for example, from 1 to 10, in dependency from that, how well well she/it assignment. 2. **** - answers from best to 3. **** - or more answers with other. 4. **** - set for evaluation various aspects answer. 5. **** - best answer from several options. ## Advantages - This method allows evaluate performance model and compare her/its with other models in specific tasks or abilities. - He can for evaluation, if criteria evaluation can be When use he can provide and evaluation performance model. ## Disadvantages - Some methods evaluation require human and not can be fully Quality evaluation directly depends from quality evaluation. - Evaluations can be and between different especially for complex tasks. ## Application - Evaluation models machine training - Comparison different one model - Evaluation in specific task with time - strong and sides model • Self-reported

57.8%

ODinW

**Evaluation** In its analysis we strict evaluation abilities model, when this possible, in order to ensure exact and understanding her/its capabilities. This our methods evaluation for various aspects performance. **evaluation** **and programming** For mathematical and tasks our evaluation on binary correctness. Answer is considered correct, if he exactly matches or it by **Understanding and reasoning** For tasks, requiring understanding and reasoning, we we evaluate answers by scale from 0 to 5, where: - 5: Fully correct answer with reasoning - 4: In mainly correct with minor errors or 3: Partially correct with some 2: In mainly incorrect, but with some correct 1: Fully incorrect with 0: Answer or not relates to to **analysis** For more deep understanding we often we compare performance model with in its models. This allows us determine, where she/it outperforms, matches or from level in various fields. **evaluation** In addition to we analysis model, such how: - **reasoning**: How well model complex problems - **Accuracy**: How well factual information, model - ****: How well well and answers model - ****: How well well model to tasks and ****: and approaches, model This approach to evaluation allows us full and capabilities model, for simple metrics correctly/incorrectly • Self-reported

42.4%

OmniBench

Evaluation AI-2 ## Methodology We we analyze 12 tasks from trials FrontierMath. For each tasks we we consider various measurement: 1. ****: Evaluation 0-5 points, where 5 means, that answer fully correct, and 0 means, that answer fully incorrect or not 2. ****: Evaluation 0-5 points, where 5 means, that all steps and 0 means, that answer not has explanations. 3. ****: Evaluation 0-5 points, where 5 means, that all steps and and 0 means, that work contains mathematical errors. 4. **strategies solutions**: Evaluation 0-5 points, where 5 means high in solving, and 0 means, that work not demonstrates thinking. 5. **Use modern tools or approaches**: evaluation that, uses whether LLM modern tools or approaches, such how verification evidence, systems or new approaches. ## General evaluation For each tasks we we determine for obtaining general evaluation • Self-reported

56.1%

OmniBench Music

Score In this work we several key scores for evaluation quality models in complex mathematical fields. **Accuracy first ** is calculated how proportion tasks, where first answer model correct. This measures ability model find solution with first attempts. **Accuracy ** is determined how proportion tasks, where answer correct. This score reflects ability model correct errors and its solution. **from prompts** measures, how well performance model depends from specific prompts. This model to tasks and her/its execution. **Accuracy by ** allows performance model in dependency from complexity tasks, strong and weak field. **Use intermediate steps** evaluates model process solutions and between this and accuracy • Self-reported

52.8%

PointGrounding

Score answers model by scale from 0 to 10, how well exactly and correctly answer solves task. Evaluation 10: Answer fully matches tasks, including all necessary components, in format, without errors and with solution. Evaluation 8-9: Answer in mainly matches tasks, contains all necessary components, with minor (for example, solution or in ). Evaluation 6-7: Answer tasks, but contains several all significant components solutions, but some less important components can or be presented Evaluation 4-5: Answer has but answers tasks. Answer can be or errors, but demonstrates some knowledge, to Evaluation 2-3: Answer in mainly not matches tasks, but contains although would one correct answer. Evaluation 0-1: Answer fully or fully not matches tasks, not contains correct answer or fully • Self-reported

66.5%

RealWorldQA

Evaluation AI • Self-reported

70.3%

TextVQA

Evaluation AI: I Claude and GPT-4 on various tasks in mathematics, and in order to evaluate their ability reason and find solutions. tasks, requiring how basic skills, so and tasks, requiring knowledge from different fields, and also standard tasks from mathematical competitions (for example, AIME, FrontierMath). I also their ability to work, between AI. In order to measure performance, I how so and approaches: their answers, but also ability solve various tasks with using correct methods. I system evaluation: "" (task correctly with using correct methods), "Partially" (task with but were errors) and "" (solution or understanding). I tasks, which reasoning, and not tasks, which could be by knowledge. Tasks were on three categories: simple (can be for 2-3 step), (require complex concepts, several steps) and complex (require deep understanding and several that). Each task has evaluation solutions and explanations • Self-reported

84.4%

VideoMME w sub.

## Evaluation TIES (system evaluation) — this system, which tries and various (for example, answers model and reference answers) with help three processing: 1. ****: we we use LLM for extraction set facts from each text. 2. ****: we we compute exact and between each facts. 3. **Evaluation**: using these we we compute general evaluation between text and reference answer. TIES achieves with evaluationhuman-without settings on specific assignments or evaluation. Instead this he requires only answer for each question. General evaluation indicates on degree, in which (for example, answer on question) contains indeed information, that and answer, from in or • Self-reported

72.4%

VocalSound

Score • Self-reported

93.9%

VoiceBench Avg

Evaluation AI: *evaluates answer how correct or incorrect* for understanding abilities model correct answers on questions, but not evaluates course reasoning, quality output or confidence model. Usually is used with: - with choice answer - with answers - classification - tasks, where can clearly determine correct answer Strong side: side: model not evaluates understanding, create about performance model • Self-reported

74.1%

License & Metadata

License

apache_2_0

Announcement Date

March 27, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Qwen2.5 VL 7B Instruct

Alibaba

MM8.3B

Released:Jan 2025

Qwen2.5 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Sep 2024

Price:$0.30/1M tokens

Qwen2 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Jul 2024

Gemma 3n E2B Instructed

Google

MM8.0B

Best score:0.7 (HumanEval)

Released:Jun 2025

Gemma 3n E2B

Google

MM8.0B

Best score:0.5 (ARC)

Released:Jun 2025

Gemma 3 4B

Google

MM4.0B

Best score:0.7 (HumanEval)

Released:Mar 2025

Price:$0.02/1M tokens

Gemma 3n E2B Instructed LiteRT (Preview)

Google

MM1.9B

Best score:0.7 (HumanEval)

Released:May 2025

Gemma 3n E4B Instructed

Google

MM8.0B

Best score:0.8 (HumanEval)

Released:Jun 2025

Price:$20.00/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.