Qwen2 7B Instruct

Name: Qwen2 7B Instruct
Author: Alibaba

Alibaba

Qwen2-7B-Instruct is an instruction-tuned language model with 7 billion parameters, supporting a context window of up to 131,072 tokens.

Key Specifications

Parameters

7.6B

Context

Release Date

July 23, 2024

Average Score

59.5%

API Documentation Research Paper Repository Model Weights

Timeline

Key dates in the model's history

Announcement

July 23, 2024

Last Update

July 19, 2025

Today

May 10, 2026

Technical Specifications

Parameters

7.6B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Accuracy AI: for translation: Translation "Accuracy" how "Accuracy" matches in field machine training and artificial intelligence on language • Self-reported

70.5%

Programming

Programming skills tests

HumanEval

Pass@1 Metric Pass@1 measures probability that, that solution will correct with first attempts. In difference from metrics Pass@k, which gives model k attempts, metric Pass@1 provides model only one attempt. High score Pass@1 means, that model can generate correct solutions without necessity do several attempts. This important for real applications, where users usually on first answer and not have capabilities verify several options. For computation Pass@1 is evaluated, solves whether attempt model task correctly. This can make with help (for example, execution code) or by means of comparison with reference answers. Metric Pass@1 especially useful for evaluation models, in context, when important reliability first answer, for example, in or decision-making solutions • Self-reported

79.9%

MBPP

Pass@1 Metric Pass@1 is evaluation performance model, probability that, that model attempt solutions tasks this percentage tasks, which model can solve with first attempts. metric especially important for evaluation abilities model perform tasks without necessity attempts or iterations. High score Pass@1 about reliability model and her/its abilities provide exact results without additional attempts. Pass@1 often is used in benchmarks programming and mathematical tasks, where can determine correctness solutions. This metric gives more evaluation real capabilities model, than metrics, attempts, such how Pass@k for k > 1 • Self-reported

67.2%

Mathematics

Mathematical problems and computations

GSM8k

Accuracy AI • Self-reported

82.3%

MATH

Accuracy AI: ChatGPT (GPT-4o) this assignment He simple, translation "Accuracy" how "Accuracy", that is correct in context evaluation models AI. Translation matches all not information, and Answer not contains quotes or other • Self-reported

49.6%

Reasoning

Logical reasoning and analysis

GPQA

Accuracy AI • Self-reported

25.3%

Other Tests

Specialized benchmarks

AlignBench

Evaluation AI: I task and its reasoning, evaluating step for step. Human: solution from 0 to 10, where 0 means fully solution with errors, and 10 — fully correct solution. not only answer, but and method and justification. and solutions, errors, if they is, and that can was would improve • Self-reported

72.1%

C-Eval

Accuracy We we evaluate accuracy solutions LLM for tasks on level competitions by mathematics. When this possible, we each task such manner, in order to have specific or answer. This allows us automatically evaluate answers model, usually match answer solving. For tasks with several answers (for example, where is required answer in form) we we verify solutions LLM In given work we in mainly we evaluate accuracy on tasks level competitions. We on sets data AIME and FrontierMath, and also on tasks from Harvard-MIT Mathematics Tournament (HMMT) and other competitions. These tasks clearly specific correct answers, evaluation • Self-reported

77.2%

EvalPlus

Pass@1 This score efficiency model AI in solving problems generation code. He indicates percentage tasks, which model can solve with first attempts. When Pass@1 model performs n attempts for each tasks and verifies, how many tasks have although would one correct solution. Then is applied probability that, that model will solve task with first attempts. : Pass@1 = 1 - (1 - c/n)^n, where c — number correct solutions among n attempts. This method evaluation models, from or size model. Pass@1 score performance in field generation code, for comparison various models • Self-reported

70.3%

LiveCodeBench

## Evaluation Evaluations on basis : 1. **Match task**: How well well solution suits to task. whether it understanding model and problems. 2. ****: Correctness computations, explanations and reasoning. All computation should be and conclusions should be for logical errors or errors in computations. 3. **solutions**: How well and solution. whether course thoughts model. whether conclusions. solution with steps, which one from score. 4. ****: Quality answer in whole, including understanding and approach to solving tasks. For each is used from 1 to 5: - 1: 2: 3: 4: Good - 5: General evaluation — this average evaluations by criteria, to numbers • Self-reported

26.6%

MMLU-Pro

Accuracy AI: access to this is evaluation that, how access to on reasoning language models. We how LLM use various tools for solutions tasks and how well this improves their performance. For this we developed new set tasks from different fields. Tasks so, in order to be for LLM, but sufficiently complex, in order to tool use for obtaining results. Each task is evaluated by which various aspects abilities model to reasoning. results: - LLM significantly from access to for majority tasks - Efficiency tool use in degree depends from specific tasks - that some model actually show results for specific tasks at to between various models in their abilities effectively tool use We we consider, that this gives understanding capabilities and limitations tools for improvements reasoning LLM • Self-reported

44.1%

MT-Bench

**Evaluation** LLM-TinyStories-Eval for creation benchmarks has system, on Flesch Reading Ease (FRE), which evaluates by scale from 0 to 100. values indicate on text. TinyStories has FRE 94.47, that matches 8-9 For analysis models we we use GPT-4 in capacity for evaluation two type evaluations: 1. **(0-5)**: evaluation means, that and 2. **/(0-5)**: how well suits for on level initial school. evaluation means use simple words, and • Self-reported

84.1%

MultiPL-E

Pass@1 Metric Pass@1 measures, which percentage test cases model can solve with first attempts. More values performance. In difference from other methods, such how several solutions in parallel and choice most (self-consistency) or various options prompts, Pass@1 evaluates ability model generate correct answer immediately, without several attempts. This especially for real scenarios, where users correct solutions without necessity queries or several computations • Self-reported

59.1%

TheoremQA

Accuracy AI: LaMDA vs. Claude • Self-reported

25.3%

License & Metadata

License

apache_2_0

Announcement Date

July 23, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Qwen2.5 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Sep 2024

Price:$0.30/1M tokens

Qwen3.5 9B

Alibaba

9.0B

Released:Mar 2026

Qwen2.5-Coder 7B Instruct

Alibaba

7.0B

Best score:0.9 (HumanEval)

Released:Sep 2024

Qwen3-235B-A22B-Instruct-2507

Alibaba

235.0B

Best score:0.8 (GPQA)

Released:Jul 2025

Price:$0.15/1M tokens

Qwen3 Max

Alibaba

Best score:0.6 (GPQA)

Released:Dec 2025

Qwen2.5-Omni-7B

Alibaba

MM7.0B

Best score:0.8 (HumanEval)

Released:Mar 2025

Llama 3.1 8B Instruct

Ministral 8B Instruct

Mistral AI

8.0B

Best score:0.7 (ARC)

Released:Oct 2024

Price:$0.10/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.