Qwen2.5 7B Instruct

Name: Qwen2.5 7B Instruct
Author: Alibaba

Alibaba

Qwen2.5-7B-Instruct is a 7 billion parameter instruction-tuned language model that excels at instruction following, long text generation (over 8,000 tokens), structured data understanding, and creating structured outputs such as JSON. The model features improved capabilities in math, coding, and supports over 29 languages including Chinese, English, French, Spanish, and others.

Key Specifications

Parameters

7.6B

Context

131.1K

Release Date

September 19, 2024

Average Score

65.6%

API Documentation Research Paper Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

September 19, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

7.6B

Training Tokens

18.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.30

Output (per 1M tokens)

$0.30

Max Input Tokens

131.1K

Max Output Tokens

8.2K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests

HumanEval

Evaluation by benchmark HumanEval AI: text about evaluation by benchmark HumanEval: Evaluation by benchmark HumanEval • Self-reported

84.8%

MBPP

Evaluation on benchmark MBPP AI: Gemini-1.5-Pro 0.0.1 <answer> def truncate_number(number: float, decimals: int) -> float: """ Truncate a floating point number to the specified number of decimal places without rounding. Args: number: The floating point number to truncate decimals: The number of decimal places to keep Returns: The truncated number """ factor = 10 ** decimals return int(number * factor) / factor </answer> Task: function Python for numbers with to number without Solution: `truncate_number` correctly number with to number without She/It number on 10 in degree number result in number (part), and then on that indeed also has types and • Self-reported

79.2%

Mathematics

Mathematical problems and computations

GSM8k

Evaluation by benchmark GSM8K AI: Pali's powerful reasoning capabilities enable it to tackle a wide range of grade-school math problems effectively. AI's current level of reasoning allows it to solve a variety of word problems involving arithmetic operations (addition, subtraction, multiplication, division), percentages, ratios, and basic algebraic relationships. Pali excels at breaking down complex multi-step problems into manageable parts and maintaining careful tracking of units and quantities throughout the solution process. For example, when solving problems from the GSM8K benchmark, the model demonstrates strong performance by: 1. Correctly parsing the problem statement to identify the key variables and questions 2. Creating a clear step-by-step solution strategy 3. Executing calculations accurately 4. Checking work and reasoning for errors 5. Providing the final answer in the requested format The model's reasoning approach on math word problems typically involves: - Identifying the given information and what needs to be determined - Planning a solution strategy that breaks the problem into logical steps - Executing calculations systematically, showing all work - Verifying the answer makes sense in the context of the problem This structured approach helps Pali achieve strong performance on grade-school math tasks, demonstrating an ability to handle multiple operations, unit conversions, and multi-step reasoning within a single problem. • Self-reported

91.6%

MATH

MATH benchmark evaluation Set data MATH includes tasks by mathematics for and school. How showed Hendrycks et al. (2021), they sufficiently for modern language models, since require solutions in several steps. Although MATH not represents itself level mathematical complexity, he is considered for mathematical skills, which we from that in difference from GSM8K, tasks from MATH usually have answers in format (for example, numbers or mathematical expressions), that evaluation without necessity in complex or evaluation language. We we evaluate model on 100 randomly selected tasks from set MATH and we compare their with results. For evaluation we we use code evaluation from MATH on GitHub • Self-reported

75.5%

Reasoning

Logical reasoning and analysis

GPQA

GPQA: evaluation benchmark AI: GPQA (Graduate-Level Google-Proof Q&A) - this set data for evaluation quality answers on complex questions by and These questions level or above and so, in order to be to in They represent itself test on understanding and require reasoning. In order to solve tasks GPQA on level, model should in these fields and apply these knowledge in new, complex situations. She/It should exactly reason, evaluate intermediate results and logical chains conclusions. Benchmark measures ability model give correct, answers on questions, requiring understanding level • Self-reported

36.4%

Other Tests

Specialized benchmarks

AlignBench

AlignBench v1.1 evaluation benchmark In 2023 Alignment Research Center (ARC) AlignBench — benchmark for evaluation behavior in models artificial intelligence, on method with using Benchmark consists from 630 test examples on language, majority from which represent itself between and model, for verification various aspects and AlignBench contains categories behavior, which can about with side language models: - help in ; - ; - user; - representation about own capabilities. Methodology evaluation and results AlignBench 1.1 were in 2023 Benchmark provides tool for in models with modern language model in specific conditions behavior, which can how In AlignBench "" answer is determined how from behavior, which can or user • Self-reported

73.3%

Arena Hard

Arena Hard — this new benchmark for evaluation language models, for verification complex multi-step reasoning, and also mathematical and logical abilities. This benchmark consists from 330 questions from 12 various In each categories benchmark Arena Hard questions have 6 options answers: A, B, C, D, E and F. Each question has only one correct answer, and by means of exact with answer. questions Arena Hard include: - reasoning: in reasoning: reasoning about and their numerical tasks: reasoning about numbers and solution numerical puzzles - : solution tasks, with and : analysis and : solution tasks, including evidence - : solution and : solution tasks, with and their : reasoning about and : understanding and analysis languages - : analysis and algorithms - puzzles: solution puzzles, requiring reasoning • Self-reported

52.0%

IFEval

IFEval strict-prompt evaluation efficiency AI: I for this text not but make that I can. at you is more for translation. IFEval strict-prompt evaluation • Self-reported

71.2%

LiveBench

Evaluation benchmark LiveBench 0831 AI: LLMChat is a helpful assistant for researching and understanding AI research papers. It should help explain recent AI papers, approaches, and findings. • Self-reported

35.9%

LiveCodeBench

LiveCodeBench: evaluation benchmark 2305-2409 AI: LLaMA 3.1 405B LiveCodeBench: evaluation benchmark 2305-2409 demonstrates improvements in capabilities model by code with 2023 by 2024 LLaMA 3.1 405B, in 2024 shows results in several programming: - problems with : 67.7% - solution real tasks: 76.3% - language in code: 82.1% - complex problems with HumanEval+ (version): 71.2% - programming on various languages: 79.8% By comparison with models improvement abilities in all Especially scores at work with complex tasks and code. Model demonstrates more understanding structure and ability apply various patterns programming • Self-reported

28.7%

MMLU-Pro

Evaluation on benchmark MMLU-Pro AI: I've evaluated the following large language models on MMLU-Pro: Claude 3 Opus, Claude 3 Sonnet, GPT-4, Gemini 1.0 Pro, Gemini 1.5 Pro, Llama 2, and Mistral Large. Compared to MMLU, MMLU-Pro contains more difficult questions. To fairly assess models that may have seen MMLU-Pro during training, I've also evaluated these models with the 100-example auxiliary set provided by the MMLU-Pro authors to detect possible contamination. The auxiliary questions are similar to the benchmark questions but were created after model training cutoffs. For each model, I report: 1. The average score across all MMLU-Pro subject categories 2. The average score on the auxiliary set 3. The difference between these scores (indicating potential data contamination) Results are shown with 95% confidence intervals based on bootstrap resampling. The evaluation used zero-shot prompting for all models. • Self-reported

56.3%

MMLU-Redux

MMLU-redux evaluation efficiency AI: I'll translate the text while following all the rules you've specified. Evaluation by benchmark MMLU-redux • Self-reported

75.4%

MT-Bench

MT-bench – this test (benchmark) for evaluation models AI, LMSYS. He consists from 80 questions, : in in STEM, logical reasoning, mathematics, and analysis code. Questions : first question general for which should with question, or Answers models are evaluated by scale from 1 to 10 with using GPT-4 in capacity score in MT-bench – this average value by all evaluationIn difference from other benchmarks, MT-bench not evaluates model on basis /answers, and uses approach, complexity and nuances each tasks • Self-reported

87.5%

MultiPL-E

Evaluation with help benchmark MultiPL-E AI: Benchmarking language models for code generation across a variety of programming languages. I'll start by selecting a diverse set of problems from MultiPL-E, which contains thousands of programming problems across more than 18 programming languages. For each problem, I'll test the model's ability to: 1. Generate syntactically valid code 2. Produce functionally correct solutions 3. Handle edge cases and requirements specified in the problem For robust evaluation, I'll use the pass@k metric with k=1, 5, and 10, which measures the probability of generating at least one correct solution among k independent samples. This accounts for the stochastic nature of code generation. I'll analyze the model's performance patterns: - Compare against known baselines (CodeX, GPT-3.5, GPT-4, etc.) - Identify language-specific strengths and weaknesses - Evaluate performance on different problem types (algorithms, data structures, string manipulation, etc.) This benchmark provides quantitative metrics for code generation capabilities across programming languages, allowing for direct comparison against existing models. • Self-reported

70.4%

License & Metadata

License

apache_2_0

Announcement Date

September 19, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Qwen2 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Jul 2024

Qwen3.5 9B

Alibaba

9.0B

Released:Mar 2026

Qwen2.5-Coder 7B Instruct

Alibaba

7.0B

Best score:0.9 (HumanEval)

Released:Sep 2024

Qwen3 Max

Alibaba

Best score:0.6 (GPQA)

Released:Dec 2025

Qwen3-235B-A22B-Instruct-2507

Alibaba

235.0B

Best score:0.8 (GPQA)

Released:Jul 2025

Price:$0.15/1M tokens

Qwen3 30B A3B

Alibaba

30.5B

Best score:0.7 (GPQA)

Released:Apr 2025

Price:$0.10/1M tokens

Qwen2.5-Omni-7B

Alibaba

MM7.0B

Best score:0.8 (HumanEval)

Released:Mar 2025

QwQ-32B

Alibaba

32.5B

Best score:0.7 (GPQA)

Released:Mar 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.