Alibaba logo

Qwen2 7B Instruct

Alibaba

Qwen2-7B-Instruct is an instruction-tuned language model with 7 billion parameters, supporting a context window of up to 131,072 tokens.

Key Specifications

Parameters
7.6B
Context
-
Release Date
July 23, 2024
Average Score
59.5%

Timeline

Key dates in the model's history
Announcement
July 23, 2024
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
7.6B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Accuracy AI: for translation: Translation "Accuracy" how "Accuracy" matches in field machine training and artificial intelligence on languageSelf-reported
70.5%

Programming

Programming skills tests
HumanEval
Pass@1 Metric Pass@1 measures probability that, that solution will correct with first attempts. In difference from metrics Pass@k, which gives model k attempts, metric Pass@1 provides model only one attempt. High score Pass@1 means, that model can generate correct solutions without necessity do several attempts. This important for real applications, where users usually on first answer and not have capabilities verify several options. For computation Pass@1 is evaluated, solves whether attempt model task correctly. This can make with help (for example, execution code) or by means of comparison with reference answers. Metric Pass@1 especially useful for evaluation models, in context, when important reliability first answer, for example, in or decision-making solutionsSelf-reported
79.9%
MBPP
Pass@1 Metric Pass@1 is evaluation performance model, probability that, that model attempt solutions tasks this percentage tasks, which model can solve with first attempts. metric especially important for evaluation abilities model perform tasks without necessity attempts or iterations. High score Pass@1 about reliability model and her/its abilities provide exact results without additional attempts. Pass@1 often is used in benchmarks programming and mathematical tasks, where can determine correctness solutions. This metric gives more evaluation real capabilities model, than metrics, attempts, such how Pass@k for k > 1Self-reported
67.2%

Mathematics

Mathematical problems and computations
GSM8k
Accuracy AISelf-reported
82.3%
MATH
Accuracy AI: ChatGPT (GPT-4o) this assignment He simple, translation "Accuracy" how "Accuracy", that is correct in context evaluation models AI. Translation matches all not information, and Answer not contains quotes or otherSelf-reported
49.6%

Reasoning

Logical reasoning and analysis
GPQA
Accuracy AISelf-reported
25.3%

Other Tests

Specialized benchmarks
AlignBench
Evaluation AI: I task and its reasoning, evaluating step for step. Human: solution from 0 to 10, where 0 means fully solution with errors, and 10 — fully correct solution. not only answer, but and method and justification. and solutions, errors, if they is, and that can was would improveSelf-reported
72.1%
C-Eval
Accuracy We we evaluate accuracy solutions LLM for tasks on level competitions by mathematics. When this possible, we each task such manner, in order to have specific or answer. This allows us automatically evaluate answers model, usually match answer solving. For tasks with several answers (for example, where is required answer in form) we we verify solutions LLM In given work we in mainly we evaluate accuracy on tasks level competitions. We on sets data AIME and FrontierMath, and also on tasks from Harvard-MIT Mathematics Tournament (HMMT) and other competitions. These tasks clearly specific correct answers, evaluationSelf-reported
77.2%
EvalPlus
Pass@1 This score efficiency model AI in solving problems generation code. He indicates percentage tasks, which model can solve with first attempts. When Pass@1 model performs n attempts for each tasks and verifies, how many tasks have although would one correct solution. Then is applied probability that, that model will solve task with first attempts. : Pass@1 = 1 - (1 - c/n)^n, where c — number correct solutions among n attempts. This method evaluation models, from or size model. Pass@1 score performance in field generation code, for comparison various modelsSelf-reported
70.3%
LiveCodeBench
## Evaluation Evaluations on basis : 1. **Match task**: How well well solution suits to task. whether it understanding model and problems. 2. ****: Correctness computations, explanations and reasoning. All computation should be and conclusions should be for logical errors or errors in computations. 3. **solutions**: How well and solution. whether course thoughts model. whether conclusions. solution with steps, which one from score. 4. ****: Quality answer in whole, including understanding and approach to solving tasks. For each is used from 1 to 5: - 1: 2: 3: 4: Good - 5: General evaluation — this average evaluations by criteria, to numbersSelf-reported
26.6%
MMLU-Pro
Accuracy AI: access to this is evaluation that, how access to on reasoning language models. We how LLM use various tools for solutions tasks and how well this improves their performance. For this we developed new set tasks from different fields. Tasks so, in order to be for LLM, but sufficiently complex, in order to tool use for obtaining results. Each task is evaluated by which various aspects abilities model to reasoning. results: - LLM significantly from access to for majority tasks - Efficiency tool use in degree depends from specific tasks - that some model actually show results for specific tasks at to between various models in their abilities effectively tool use We we consider, that this gives understanding capabilities and limitations tools for improvements reasoning LLMSelf-reported
44.1%
MT-Bench
**Evaluation** LLM-TinyStories-Eval for creation benchmarks has system, on Flesch Reading Ease (FRE), which evaluates by scale from 0 to 100. values indicate on text. TinyStories has FRE 94.47, that matches 8-9 For analysis models we we use GPT-4 in capacity for evaluation two type evaluations: 1. **(0-5)**: evaluation means, that and 2. **/(0-5)**: how well suits for on level initial school. evaluation means use simple words, andSelf-reported
84.1%
MultiPL-E
Pass@1 Metric Pass@1 measures, which percentage test cases model can solve with first attempts. More values performance. In difference from other methods, such how several solutions in parallel and choice most (self-consistency) or various options prompts, Pass@1 evaluates ability model generate correct answer immediately, without several attempts. This especially for real scenarios, where users correct solutions without necessity queries or several computationsSelf-reported
59.1%
TheoremQA
Accuracy AI: LaMDA vs. ClaudeSelf-reported
25.3%

License & Metadata

License
apache_2_0
Announcement Date
July 23, 2024
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.