Qwen2 72B Instruct

Name: Qwen2 72B Instruct
Author: Alibaba

Alibaba

Qwen2-72B-Instruct is an instruction-tuned language model with 72 billion parameters, supporting a context window of up to 131,072 tokens. It is part of the new Qwen2 series, which outperforms most open-source models and demonstrates competitiveness against proprietary models across various benchmarks.

Key Specifications

Parameters

72.0B

Context

Release Date

July 23, 2024

Average Score

73.6%

API Documentation Research Paper Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

July 23, 2024

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

72.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

Accuracy AI • Self-reported

87.6%

MMLU

Accuracy AI21 Lab • Self-reported

82.3%

TruthfulQA

Accuracy AI: 93.8% Human: 93.8% Claude 3 Opus performs task PARITY approximately with that indeed accuracy, that and people. For this tasks is required verify, has whether (for example, "10110") or number Although Claude demonstrates level accuracy for all tasks in whole, at more its performance by data we template: model handles with data, but accuracy by For from 5 to 10 characters Claude achieves 100% accuracy, people. However for more 20 characters accuracy model to approximately 80%, in then time how people support high accuracy even for long This about that, that model can use strategy solutions this tasks by comparison with In then time how people sequentially (+1 for each "1"), model, possible, tries all that more at • Self-reported

54.8%

Winogrande

Accuracy AI • Self-reported

85.1%

Programming

Programming skills tests

HumanEval

Pass@1 Metric evaluation, accuracy or models at first attempt solutions tasks. how proportion tasks, which model solves correctly with first times, without necessity repeated attempts or iterations. Pass@1 often is used at evaluation performance language models in tasks programming or reasoning. High score Pass@1 means, that model can immediately generate answers, that especially important in scenarios, where no capabilities or and correct answers In difference from other metrics, which several attempts (for example, Pass@k, where k > 1), Pass@1 represents itself more strict since evaluates only answer model • Self-reported

86.0%

MBPP

Pass@1 Metric evaluation, proportion tasks, which model solves correctly with first attempts. This metric, which number correct answers model at attempt. Although Pass@1 is metric for evaluation performance models, she/it can not model in solving tasks, since not accounts for capability several attempts or generation answers. metrics include Pass@k (solutions tasks at k attempts) and performance at solving with which can give more full representation about model • Self-reported

80.2%

Mathematics

Mathematical problems and computations

GSM8k

Accuracy AI: 0.5, Human: 0.5 • Self-reported

91.1%

MATH

Accuracy We we measure quality work model through accuracy — proportion correct answers on assignments. We we consider answer correct, if he matches with answer approach to evaluation accuracy in dependency from type tasks: - For tasks with multiple choice we we verify, whether model that indeed option, that and For tasks (for example, mathematical tasks with answer) we we use comparison, only in (for example, "4" against "4.0"). - For tasks understanding we when comparison shows This metric allows directly compare efficiency different models, methods or approaches to solving specific types tasks • Self-reported

59.7%

Reasoning

Logical reasoning and analysis

GPQA

Accuracy AI • Self-reported

42.4%

Other Tests

Specialized benchmarks

ARC-C

Accuracy AI • Self-reported

68.9%

BBH

Accuracy AI/ML Computer science Machine Learning Supervised Learning Accuracy is one of the most common evaluation metrics for classification tasks. It is defined as the number of correct predictions made by the model divided by the total number of predictions. Formally, accuracy = (TP + TN) / (TP + TN + FP + FN) where TP, TN, FP, and FN stand for True Positives, True Negatives, False Positives, and False Negatives, respectively. While accuracy is intuitive and easy to understand, it has limitations, particularly for imbalanced datasets where one class appears much more frequently than others. In such cases, a model can achieve high accuracy simply by predicting the majority class most of the time, without actually learning to distinguish between classes. For this reason, accuracy is often complemented by other metrics such as precision, recall, F1 score, or ROC AUC, which provide more nuanced evaluations of model performance. • Self-reported

82.4%

C-Eval

Accuracy AI models that aim to reliably produce correct answers to specific questions can be measured according to accuracy on test benchmarks. The correctness of a model's answer is generally determined by comparison to a reference, which is often a human consensus reference with high confidence. Benchmark accuracy is most useful when the questions have objectively correct answers, and it covers all capability dimensions that are relevant to the intended AI model use cases. Here are a few common types of benchmarks for accuracy: Knowledge: These benchmarks test a model's ability to recall facts correctly. Examples include TriviaQA, WebGPT Comparison, NaturalQuestions, and TruthfulQA. STEM Reasoning: Benchmarks like MMLU, GPQA, GSM8K, MATH, and competition math like AIME assess whether a model can apply the correct reasoning to solve challenging math, science, and engineering problems. Programming and Engineering: HumanEval, MBPP, and other code generation datasets test the model's ability to correctly complete a programming task or function. Multilingual: Datasets like MGSM, BELEBELE, Flores, XNLI etc. help assess whether the accuracy of a model generalizes across languages. AI Model developers typically report accuracy as a percentage of questions answered correctly, though some benchmarks have unique scoring methods, including partial credit for multi-step problems. Accuracy is just one component of capability evaluation. High accuracy alone doesn't guarantee that an AI model will be helpful or safe in real-world applications. • Self-reported

83.8%

CMMLU

Accuracy AI: ChatGPT-4 achieves almost perfect accuracy on elementary school level arithmetic problems. However, accuracy falls off dramatically when tackling upper level problems. While our model remains competitive with the state of the art, achieving high accuracy on advanced problems remains a significant challenge. • Self-reported

90.1%

EvalPlus

Pass@1 Metric, solutions tasks with first attempts. She/It shows, which percentage tasks model can solve with first not using attempts or processes. value Pass@1 especially important in contexts, where attempts or and where is required and exact solution with first times. This can be for systems time or applications. In difference from metrics, attempts, Pass@1 evaluates exclusively ability model immediately correct solution, that is score her/its understanding and in task • Self-reported

79.0%

MMLU-Pro

Accuracy AI models can vary in how accurate they are—that is, whether they produce correct answers to questions. Measuring accuracy is one of the most common model evaluation methods because it corresponds to our intuitive notion of model capabilities. Examples of evaluations focused on accuracy include answering multiple-choice questions on standardized tests, responding to trivia questions (e.g., TriviaQA), and computing answers to math problems (e.g., MATH, GSM8K). Metrics are highly task-dependent. For multiple-choice questions, a common choice is accuracy (i.e., the percentage of questions answered correctly). For other types of questions, metrics can include exact match, precision, recall, F1 score, and others, along with human assessments. • Self-reported

64.4%

MultiPL-E

Pass@1 Metric Pass@1 measures probability that, that model correct answer with first attempts. This how percentage tasks, which model solves correctly with first attempts. During many computational tasks, especially at can several attempts for achievements correct solutions. Pass@1 evaluates ability model obtain correct solution with first attempts without necessity several attempts. For computation Pass@1 model generates one solution for each tasks, and these solutions are evaluated how correct or incorrect. correct solutions represents itself score Pass@1. High score Pass@1 indicates on then, that model capable generate correct answers without necessity in several attempts, that is more deep understanding tasks and more high reliability • Self-reported

69.2%

TheoremQA

Accuracy We we evaluate model by their abilities correctly solve tasks GPQA. This evaluation most to evaluationin benchmarks LLM. We also we evaluate ability models on or aspects questions, and also its answer. We also errors from one model to other. Not all errors – some types errors or with model. Understanding these can give representation about and models, and also about that, how well effectively improvement model various types errors • Self-reported

44.4%

License & Metadata

License

tongyi_qianwen

Announcement Date

July 23, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Qwen2.5 32B Instruct

Alibaba

32.5B

Best score:0.9 (HumanEval)

Released:Sep 2024

QwQ-32B

Alibaba

32.5B

Best score:0.7 (GPQA)

Released:Mar 2025

QwQ-32B-Preview

Alibaba

32.5B

Best score:0.7 (GPQA)

Released:Nov 2024

Price:$1.20/1M tokens

Qwen2.5 72B Instruct

Alibaba

72.7B

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$1.20/1M tokens

Qwen3 30B A3B

Alibaba

30.5B

Best score:0.7 (GPQA)

Released:Apr 2025

Price:$0.10/1M tokens

Qwen2.5 14B Instruct

Alibaba

14.7B

Best score:0.8 (HumanEval)

Released:Sep 2024

Qwen3.5 35B A3B

Alibaba

35.0B

Released:Mar 2026

Qwen3-Next-80B-A3B-Instruct

Alibaba

80.0B

Released:Sep 2025

Price:$0.15/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.