Qwen2 72B Instruct
Qwen2-72B-Instruct is an instruction-tuned language model with 72 billion parameters, supporting a context window of up to 131,072 tokens. It is part of the new Qwen2 series, which outperforms most open-source models and demonstrates competitiveness against proprietary models across various benchmarks.
Key Specifications
Parameters
72.0B
Context
-
Release Date
July 23, 2024
Average Score
73.6%
Timeline
Key dates in the model's history
Announcement
July 23, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
72.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
HellaSwag
Accuracy
AI • Self-reported
MMLU
Accuracy
AI21 Lab • Self-reported
TruthfulQA
Accuracy AI: 93.8% Human: 93.8% Claude 3 Opus performs task PARITY approximately with that indeed accuracy, that and people. For this tasks is required verify, has whether (for example, "10110") or number Although Claude demonstrates level accuracy for all tasks in whole, at more its performance by data we template: model handles with data, but accuracy by For from 5 to 10 characters Claude achieves 100% accuracy, people. However for more 20 characters accuracy model to approximately 80%, in then time how people support high accuracy even for long This about that, that model can use strategy solutions this tasks by comparison with In then time how people sequentially (+1 for each "1"), model, possible, tries all that more at • Self-reported
Winogrande
Accuracy
AI • Self-reported
Programming
Programming skills tests
HumanEval
Pass@1 Metric evaluation, accuracy or models at first attempt solutions tasks. how proportion tasks, which model solves correctly with first times, without necessity repeated attempts or iterations. Pass@1 often is used at evaluation performance language models in tasks programming or reasoning. High score Pass@1 means, that model can immediately generate answers, that especially important in scenarios, where no capabilities or and correct answers In difference from other metrics, which several attempts (for example, Pass@k, where k > 1), Pass@1 represents itself more strict since evaluates only answer model • Self-reported
MBPP
Pass@1 Metric evaluation, proportion tasks, which model solves correctly with first attempts. This metric, which number correct answers model at attempt. Although Pass@1 is metric for evaluation performance models, she/it can not model in solving tasks, since not accounts for capability several attempts or generation answers. metrics include Pass@k (solutions tasks at k attempts) and performance at solving with which can give more full representation about model • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
Accuracy
AI: 0.5, Human: 0.5 • Self-reported
MATH
Accuracy We we measure quality work model through accuracy — proportion correct answers on assignments. We we consider answer correct, if he matches with answer approach to evaluation accuracy in dependency from type tasks: - For tasks with multiple choice we we verify, whether model that indeed option, that and For tasks (for example, mathematical tasks with answer) we we use comparison, only in (for example, "4" against "4.0"). - For tasks understanding we when comparison shows This metric allows directly compare efficiency different models, methods or approaches to solving specific types tasks • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
Accuracy
AI • Self-reported
Other Tests
Specialized benchmarks
ARC-C
Accuracy
AI • Self-reported
BBH
Accuracy
AI/ML
Computer science
Machine Learning
Supervised Learning
Accuracy is one of the most common evaluation metrics for classification tasks. It is defined as the number of correct predictions made by the model divided by the total number of predictions.
Formally, accuracy = (TP + TN) / (TP + TN + FP + FN) where TP, TN, FP, and FN stand for True Positives, True Negatives, False Positives, and False Negatives, respectively.
While accuracy is intuitive and easy to understand, it has limitations, particularly for imbalanced datasets where one class appears much more frequently than others. In such cases, a model can achieve high accuracy simply by predicting the majority class most of the time, without actually learning to distinguish between classes.
For this reason, accuracy is often complemented by other metrics such as precision, recall, F1 score, or ROC AUC, which provide more nuanced evaluations of model performance. • Self-reported
C-Eval
Accuracy
AI models that aim to reliably produce correct answers to specific questions can be measured according to accuracy on test benchmarks. The correctness of a model's answer is generally determined by comparison to a reference, which is often a human consensus reference with high confidence.
Benchmark accuracy is most useful when the questions have objectively correct answers, and it covers all capability dimensions that are relevant to the intended AI model use cases.
Here are a few common types of benchmarks for accuracy:
Knowledge: These benchmarks test a model's ability to recall facts correctly. Examples include TriviaQA, WebGPT Comparison, NaturalQuestions, and TruthfulQA.
STEM Reasoning: Benchmarks like MMLU, GPQA, GSM8K, MATH, and competition math like AIME assess whether a model can apply the correct reasoning to solve challenging math, science, and engineering problems.
Programming and Engineering: HumanEval, MBPP, and other code generation datasets test the model's ability to correctly complete a programming task or function.
Multilingual: Datasets like MGSM, BELEBELE, Flores, XNLI etc. help assess whether the accuracy of a model generalizes across languages.
AI
Model developers typically report accuracy as a percentage of questions answered correctly, though some benchmarks have unique scoring methods, including partial credit for multi-step problems.
Accuracy is just one component of capability evaluation. High accuracy alone doesn't guarantee that an AI model will be helpful or safe in real-world applications. • Self-reported
CMMLU
Accuracy
AI: ChatGPT-4 achieves almost perfect accuracy on elementary school level arithmetic problems. However, accuracy falls off dramatically when tackling upper level problems. While our model remains competitive with the state of the art, achieving high accuracy on advanced problems remains a significant challenge. • Self-reported
EvalPlus
Pass@1 Metric, solutions tasks with first attempts. She/It shows, which percentage tasks model can solve with first not using attempts or processes. value Pass@1 especially important in contexts, where attempts or and where is required and exact solution with first times. This can be for systems time or applications. In difference from metrics, attempts, Pass@1 evaluates exclusively ability model immediately correct solution, that is score her/its understanding and in task • Self-reported
MMLU-Pro
Accuracy
AI models can vary in how accurate they are—that is, whether they produce correct answers to questions. Measuring accuracy is one of the most common model evaluation methods because it corresponds to our intuitive notion of model capabilities. Examples of evaluations focused on accuracy include answering multiple-choice questions on standardized tests, responding to trivia questions (e.g., TriviaQA), and computing answers to math problems (e.g., MATH, GSM8K).
Metrics are highly task-dependent. For multiple-choice questions, a common choice is accuracy (i.e., the percentage of questions answered correctly). For other types of questions, metrics can include exact match, precision, recall, F1 score, and others, along with human assessments. • Self-reported
MultiPL-E
Pass@1 Metric Pass@1 measures probability that, that model correct answer with first attempts. This how percentage tasks, which model solves correctly with first attempts. During many computational tasks, especially at can several attempts for achievements correct solutions. Pass@1 evaluates ability model obtain correct solution with first attempts without necessity several attempts. For computation Pass@1 model generates one solution for each tasks, and these solutions are evaluated how correct or incorrect. correct solutions represents itself score Pass@1. High score Pass@1 indicates on then, that model capable generate correct answers without necessity in several attempts, that is more deep understanding tasks and more high reliability • Self-reported
TheoremQA
Accuracy We we evaluate model by their abilities correctly solve tasks GPQA. This evaluation most to evaluationin benchmarks LLM. We also we evaluate ability models on or aspects questions, and also its answer. We also errors from one model to other. Not all errors – some types errors or with model. Understanding these can give representation about and models, and also about that, how well effectively improvement model various types errors • Self-reported
License & Metadata
License
tongyi_qianwen
Announcement Date
July 23, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsQwen2.5 32B Instruct
Alibaba
32.5B
Best score:0.9 (HumanEval)
Released:Sep 2024
QwQ-32B
Alibaba
32.5B
Best score:0.7 (GPQA)
Released:Mar 2025
QwQ-32B-Preview
Alibaba
32.5B
Best score:0.7 (GPQA)
Released:Nov 2024
Price:$1.20/1M tokens
Qwen2.5 72B Instruct
Alibaba
72.7B
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$1.20/1M tokens
Qwen3 30B A3B
Alibaba
30.5B
Best score:0.7 (GPQA)
Released:Apr 2025
Price:$0.10/1M tokens
Qwen2.5 14B Instruct
Alibaba
14.7B
Best score:0.8 (HumanEval)
Released:Sep 2024
Qwen3.5 35B A3B
Alibaba
35.0B
Released:Mar 2026
Qwen3-Next-80B-A3B-Instruct
Alibaba
80.0B
Released:Sep 2025
Price:$0.15/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.