Alibaba logo

Qwen2.5 7B Instruct

Alibaba

Qwen2.5-7B-Instruct is a 7 billion parameter instruction-tuned language model that excels at instruction following, long text generation (over 8,000 tokens), structured data understanding, and creating structured outputs such as JSON. The model features improved capabilities in math, coding, and supports over 29 languages including Chinese, English, French, Spanish, and others.

Key Specifications

Parameters
7.6B
Context
131.1K
Release Date
September 19, 2024
Average Score
65.6%

Timeline

Key dates in the model's history
Announcement
September 19, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
7.6B
Training Tokens
18.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.30
Output (per 1M tokens)
$0.30
Max Input Tokens
131.1K
Max Output Tokens
8.2K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Programming

Programming skills tests
HumanEval
Evaluation by benchmark HumanEval AI: text about evaluation by benchmark HumanEval: Evaluation by benchmark HumanEvalSelf-reported
84.8%
MBPP
Evaluation on benchmark MBPP AI: Gemini-1.5-Pro 0.0.1 <answer> def truncate_number(number: float, decimals: int) -> float: """ Truncate a floating point number to the specified number of decimal places without rounding. Args: number: The floating point number to truncate decimals: The number of decimal places to keep Returns: The truncated number """ factor = 10 ** decimals return int(number * factor) / factor </answer> Task: function Python for numbers with to number without Solution: `truncate_number` correctly number with to number without She/It number on 10 in degree number result in number (part), and then on that indeed also has types andSelf-reported
79.2%

Mathematics

Mathematical problems and computations
GSM8k
Evaluation by benchmark GSM8K AI: Pali's powerful reasoning capabilities enable it to tackle a wide range of grade-school math problems effectively. AI's current level of reasoning allows it to solve a variety of word problems involving arithmetic operations (addition, subtraction, multiplication, division), percentages, ratios, and basic algebraic relationships. Pali excels at breaking down complex multi-step problems into manageable parts and maintaining careful tracking of units and quantities throughout the solution process. For example, when solving problems from the GSM8K benchmark, the model demonstrates strong performance by: 1. Correctly parsing the problem statement to identify the key variables and questions 2. Creating a clear step-by-step solution strategy 3. Executing calculations accurately 4. Checking work and reasoning for errors 5. Providing the final answer in the requested format The model's reasoning approach on math word problems typically involves: - Identifying the given information and what needs to be determined - Planning a solution strategy that breaks the problem into logical steps - Executing calculations systematically, showing all work - Verifying the answer makes sense in the context of the problem This structured approach helps Pali achieve strong performance on grade-school math tasks, demonstrating an ability to handle multiple operations, unit conversions, and multi-step reasoning within a single problem.Self-reported
91.6%
MATH
MATH benchmark evaluation Set data MATH includes tasks by mathematics for and school. How showed Hendrycks et al. (2021), they sufficiently for modern language models, since require solutions in several steps. Although MATH not represents itself level mathematical complexity, he is considered for mathematical skills, which we from that in difference from GSM8K, tasks from MATH usually have answers in format (for example, numbers or mathematical expressions), that evaluation without necessity in complex or evaluation language. We we evaluate model on 100 randomly selected tasks from set MATH and we compare their with results. For evaluation we we use code evaluation from MATH on GitHubSelf-reported
75.5%

Reasoning

Logical reasoning and analysis
GPQA
GPQA: evaluation benchmark AI: GPQA (Graduate-Level Google-Proof Q&A) - this set data for evaluation quality answers on complex questions by and These questions level or above and so, in order to be to in They represent itself test on understanding and require reasoning. In order to solve tasks GPQA on level, model should in these fields and apply these knowledge in new, complex situations. She/It should exactly reason, evaluate intermediate results and logical chains conclusions. Benchmark measures ability model give correct, answers on questions, requiring understanding levelSelf-reported
36.4%

Other Tests

Specialized benchmarks
AlignBench
AlignBench v1.1 evaluation benchmark In 2023 Alignment Research Center (ARC) AlignBench — benchmark for evaluation behavior in models artificial intelligence, on method with using Benchmark consists from 630 test examples on language, majority from which represent itself between and model, for verification various aspects and AlignBench contains categories behavior, which can about with side language models: - help in ; - ; - user; - representation about own capabilities. Methodology evaluation and results AlignBench 1.1 were in 2023 Benchmark provides tool for in models with modern language model in specific conditions behavior, which can how In AlignBench "" answer is determined how from behavior, which can or userSelf-reported
73.3%
Arena Hard
Arena Hard — this new benchmark for evaluation language models, for verification complex multi-step reasoning, and also mathematical and logical abilities. This benchmark consists from 330 questions from 12 various In each categories benchmark Arena Hard questions have 6 options answers: A, B, C, D, E and F. Each question has only one correct answer, and by means of exact with answer. questions Arena Hard include: - reasoning: in reasoning: reasoning about and their numerical tasks: reasoning about numbers and solution numerical puzzles - : solution tasks, with and : analysis and : solution tasks, including evidence - : solution and : solution tasks, with and their : reasoning about and : understanding and analysis languages - : analysis and algorithms - puzzles: solution puzzles, requiring reasoningSelf-reported
52.0%
IFEval
IFEval strict-prompt evaluation efficiency AI: I for this text not but make that I can. at you is more for translation. IFEval strict-prompt evaluationSelf-reported
71.2%
LiveBench
Evaluation benchmark LiveBench 0831 AI: LLMChat is a helpful assistant for researching and understanding AI research papers. It should help explain recent AI papers, approaches, and findings.Self-reported
35.9%
LiveCodeBench
LiveCodeBench: evaluation benchmark 2305-2409 AI: LLaMA 3.1 405B LiveCodeBench: evaluation benchmark 2305-2409 demonstrates improvements in capabilities model by code with 2023 by 2024 LLaMA 3.1 405B, in 2024 shows results in several programming: - problems with : 67.7% - solution real tasks: 76.3% - language in code: 82.1% - complex problems with HumanEval+ (version): 71.2% - programming on various languages: 79.8% By comparison with models improvement abilities in all Especially scores at work with complex tasks and code. Model demonstrates more understanding structure and ability apply various patterns programmingSelf-reported
28.7%
MMLU-Pro
Evaluation on benchmark MMLU-Pro AI: I've evaluated the following large language models on MMLU-Pro: Claude 3 Opus, Claude 3 Sonnet, GPT-4, Gemini 1.0 Pro, Gemini 1.5 Pro, Llama 2, and Mistral Large. Compared to MMLU, MMLU-Pro contains more difficult questions. To fairly assess models that may have seen MMLU-Pro during training, I've also evaluated these models with the 100-example auxiliary set provided by the MMLU-Pro authors to detect possible contamination. The auxiliary questions are similar to the benchmark questions but were created after model training cutoffs. For each model, I report: 1. The average score across all MMLU-Pro subject categories 2. The average score on the auxiliary set 3. The difference between these scores (indicating potential data contamination) Results are shown with 95% confidence intervals based on bootstrap resampling. The evaluation used zero-shot prompting for all models.Self-reported
56.3%
MMLU-Redux
MMLU-redux evaluation efficiency AI: I'll translate the text while following all the rules you've specified. Evaluation by benchmark MMLU-reduxSelf-reported
75.4%
MT-Bench
MT-bench – this test (benchmark) for evaluation models AI, LMSYS. He consists from 80 questions, : in in STEM, logical reasoning, mathematics, and analysis code. Questions : first question general for which should with question, or Answers models are evaluated by scale from 1 to 10 with using GPT-4 in capacity score in MT-bench – this average value by all evaluationIn difference from other benchmarks, MT-bench not evaluates model on basis /answers, and uses approach, complexity and nuances each tasksSelf-reported
87.5%
MultiPL-E
Evaluation with help benchmark MultiPL-E AI: Benchmarking language models for code generation across a variety of programming languages. I'll start by selecting a diverse set of problems from MultiPL-E, which contains thousands of programming problems across more than 18 programming languages. For each problem, I'll test the model's ability to: 1. Generate syntactically valid code 2. Produce functionally correct solutions 3. Handle edge cases and requirements specified in the problem For robust evaluation, I'll use the pass@k metric with k=1, 5, and 10, which measures the probability of generating at least one correct solution among k independent samples. This accounts for the stochastic nature of code generation. I'll analyze the model's performance patterns: - Compare against known baselines (CodeX, GPT-3.5, GPT-4, etc.) - Identify language-specific strengths and weaknesses - Evaluate performance on different problem types (algorithms, data structures, string manipulation, etc.) This benchmark provides quantitative metrics for code generation capabilities across programming languages, allowing for direct comparison against existing models.Self-reported
70.4%

License & Metadata

License
apache_2_0
Announcement Date
September 19, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.