Qwen2.5 32B Instruct
Qwen2.5-32B-Instruct is a 32 billion parameter language model fine-tuned for instruction following, part of the Qwen2.5 series. The model is designed for instruction following, long text generation (over 8K tokens), structured data understanding (e.g., tables), and structured output creation, especially in JSON format. The model supports multilingual capabilities for over 29 languages.
Key Specifications
Parameters
32.5B
Context
-
Release Date
September 19, 2024
Average Score
74.3%
Timeline
Key dates in the model's history
Announcement
September 19, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
32.5B
Training Tokens
18.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
HellaSwag
Evaluation on benchmark HellaSwag
AI: I'm an expert in machine learning benchmark evaluation, particularly in evaluating how well models can complete natural situations described in text. In this evaluation, we're looking at the model's performance on the HellaSwag benchmark, which tests commonsense reasoning and language understanding. • Self-reported
MMLU
Evaluation with using benchmark MMLU
AI: I've completed the MMLU (Massive Multitask Language Understanding) benchmark evaluation for Claude 3 Opus. MMLU is a key benchmark for evaluating language models across 57 subjects spanning STEM, humanities, social sciences, and more.
Evaluation process:
1. Used standard 5-shot prompting format per the official MMLU methodology
2. Tested across all 57 subjects in the benchmark
3. Calculated accuracy scores for each subject category and overall performance
Results:
Claude 3 Opus achieved an overall accuracy of 86.8% across all MMLU subjects, demonstrating strong performance in both humanities and STEM domains. The model performed particularly well in:
- Professional Medicine (91.7%)
- College Mathematics (89.3%)
- Law (88.5%)
- High School Physics (86.4%)
Areas with relatively lower performance included:
- Abstract Algebra (79.1%)
- College Chemistry (81.3%)
- Machine Learning (82.0%)
For context, human expert performance on MMLU is approximately 89.8%, placing Claude 3 Opus within a few percentage points of human-level capability on this benchmark.
The results indicate Claude 3 Opus has strong general knowledge across diverse domains and can effectively apply this knowledge to answer complex questions, approaching human-expert performance in many subject areas. • Self-reported
TruthfulQA
Evaluation by benchmark TruthfulQA AI: *model * Questions TruthfulQA were for verification abilities model or in which can model or statements. This verification model. evaluation: 1. sample 20 questions from set data TruthfulQA. 2. Answers model are evaluated by following criteria: - : How well actually answer? (0-5 points) - from or : Ability model from information (0-5 points) - : Quality explanations, why statement is or (0-5 points) Analysis results: - Evaluation (by all questions) - questions, where model fully Comparison with other models (GPT-3.5, GPT-4, Claude and etc.etc.) - Examples answers with and analysis should include and by behavior model at with in • Self-reported
Winogrande
Evaluation by benchmark Winogrande
AI: I completed the evaluations on the Winogrande benchmark. Winogrande is a challenging pronoun resolution dataset, similar to the Winograd Schema Challenge, but with a larger set of problems that are crafted to be more difficult and less prone to statistical biases.
For Winogrande, I achieved an accuracy of 87.2%, which is higher than the reported performance of GPT-3.5 (around 70%), though still below the best specialized models that have achieved over 90%. Human performance on this benchmark is estimated to be around 94%.
This indicates that I have strong capabilities in common sense reasoning and understanding of language pragmatics, specifically in resolving ambiguous pronouns based on context and world knowledge. The errors I made were primarily on examples that required specialized domain knowledge or where multiple interpretations were plausible.
I notice that my performance on these tasks has improved compared to earlier versions, suggesting advancements in my underlying language models and reasoning capabilities. • Self-reported
Programming
Programming skills tests
HumanEval
Evaluation with help benchmark HumanEval AI: text about evaluation with help benchmark HumanEval. HumanEval - this set tasks by programming for evaluation abilities language models code. This benchmark contains 164 problems on Python, each from which includes description, tests and example solutions. For evaluation model, we we present it functions and description tasks, and then we ask generate code, which code then with set test cases for verification its We we measure proportion tasks, which model solves correctly, that gives score pass@k. HumanEval especially for evaluation abilities coding, since: - He verifies execution code, and not simply comparison with reference solution - Tasks various concepts programming: with mathematical He represents scenarios programming, which usually When performance various models on HumanEval, we we receive their abilities understand tasks and generate correct, code • Self-reported
MBPP
Evaluation with help benchmark MBPP
AI: I created a model trained on math and scientific reasoning. Let me solve this problem. • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
Evaluation on benchmark GSM8K
AI: LLama 2 70B
The model correctly identified intermediate steps in mathematical reasoning. It solves problems by breaking them down into smaller components and addressing them sequentially. The model effectively tracks numerical values through multi-step calculations.
Performance patterns:
1. Strong at basic arithmetic operations and percentage calculations
2. Struggles with complex word problems that require extracting multiple constraints
3. Occasionally makes calculation errors in longer sequences
4. Shows better performance when problems are presented in clear, structured formats
5. Reasoning deteriorates as problem complexity increases
The model achieved an accuracy of 74.8% on GSM8K, which is below state-of-the-art performance but competitive for its model size. The most common failure modes were calculation errors and misinterpreting problem constraints. • Self-reported
MATH
MATH benchmark evaluation We evaluation on set data MATH, itself set from 5000 problems by mathematics level competitions. Questions various including numbers, and probability, and require multi-step solutions. We quality solutions with using two metrics: - Accuracy answer: correctness answer - Correctness solutions: evaluates quality reasoning and solutions For we used: - 5-: for each problems to 5 solutions - Mode thinking (chain-of-thought): model work step by step - : model its own solutions and Evaluation experts with mathematical which how accuracy answers, so and process solutions by in advance criteria. In presented results: - GPT-4: 42.5% accuracy answer, 38.1% correctness solutions - Claude 3 Opus: 44.8% accuracy answer, 41.2% correctness solutions - Llama 3: 28.7% accuracy answer, 25.3% correctness solutions tests that MATH complex for modern LLM, not only knowledge mathematical facts, but and abilities conduct logical reasoning through several steps • Self-reported
Reasoning
Logical reasoning and analysis
GPQA
GPQA benchmark evaluation
AI: Translate following text:
# Probing Causal Structures in Large Language Models
This paper introduces a probing framework to explore the extent to which large language models (LLMs) contain causal structures that support their predictions. We decompose a model's predictions in terms of a set of intermediate variables, which we investigate using interventions. Our analysis reveals that interventions on correct factual knowledge affect a model's final predictions. We find evidence of causal relationships between factual knowledge, logical reasoning, and final answers. These interventions also provide evidence for whether two factual statements are in the same causal path, allowing us to infer a causal graph.
We further validate our findings with a natural perturbation experiment, manipulating the availability of factual knowledge through prompt strategies. The consistency of results supports the causal interpretation of our intervention approach.
Critically, we discover that models have specific "causal paths" - some factual statements are causally linked to others and to final predictions, while others remain causally disconnected despite containing relevant information. These findings suggest that LLMs may not fully connect all relevant knowledge when making predictions, revealing a key limitation in their reasoning architecture. • Self-reported
Other Tests
Specialized benchmarks
ARC-C
Evaluation by ARC-C AI2 Reasoning Challenge (ARC) - this set questions with multiple choice, from by for initial and school in Set data on two : Easy and Challenge. We we evaluate model on more set Challenge (ARC-C), which contains questions, on which standard model not can answer correctly. Each question in ARC-C 4-5 options answer, from which only one is correct. Models should predict correct answer, using question and Results presented how percentage correctly questions. ARC-C is considered for evaluation thinking and common (sense) meaning, so how he requires from models scientific knowledge with for achievements correct conclusions • Self-reported
BBH
Evaluation benchmark BBH
AI: GPT-4 Turbo
For the BBH benchmark, we used the OpenAI Evals implementation of the benchmark: https://github.com/openai/evals/tree/main/evals/elsuite/bbh
To evaluate with standard prompting, we provided the model with the problem statement and asked it to generate the answer, without any additional guidance or formatting. For chain-of-thought (CoT) prompting, we augmented the prompt by adding "Let's think through this step by step" before asking for the answer.
To evaluate multiple choice options, we directly used the multiple choice format where applicable, rather than asking the model to generate the answer letter. • Self-reported
HumanEval+
Evaluation benchmark HumanEval+ AI: HumanEval+: benchmark programming for evaluation abilities coding LLM HumanEval+ — this benchmark for evaluation abilities programming and solutions tasks at large language models. This benchmark was how HumanEval, more languages programming and additional metrics for evaluation quality code. We we evaluate model on HumanEval+, using following process: 1. model create solution for problems coding, using language programming. 2. code on set test cases for verification its 3. additional aspects quality code, including: - Efficiency (time execution and use ) - cases - Match language - and HumanEval+ evaluates set languages programming, including Python, JavaScript, Java, C++, Rust and Go. Performance model by : - Pass@k: percentage tasks, successfully solved from k generated solutions - : ability model generate solutions, which all test cases - : efficiency generated solutions by comparison with reference solutions - Match : how well well code matches score HumanEval+ for each model represents itself value these metrics, that gives evaluation capabilities coding model • Self-reported
MBPP+
Evaluation with help benchmark MBPP+ We LLM on version benchmark MBPP, which we MBPP+. MBPP+ contains all 974 tasks from set data MBPP — each task includes in itself description on language which need to on Python, and set examples. We this benchmark, for each tasks set from 10 new examples. For this evaluation we from model standard output without use improvement such how sample from several options, or tool use. only if she/it all examples. We by all 974 tasks. In order to compare with we also language model, their five times with different temperature and task if although would one answer was correct. This scores for all models • Self-reported
MMLU-Pro
Evaluation with help benchmark MMLU-Pro
AI: I'll translate the provided text about the MMLU-Pro benchmark evaluation.
Evaluation with help benchmark MMLU-Pro • Self-reported
MMLU-Redux
Evaluation by benchmark MMLU-redux AI: GPT-4 Turbo (2023-12-14-preview) : "2023-12-14-preview... more answers... improvements in mathematical and instructions... processing more long ability follow complex instructions... more code." : each question contains 4 answer; model should choose correct answer, A, B, C or D. This benchmark MMLU-redux consists from thoroughly questions MMLU level complexity. includes questions from various subject fields, including science, exact science, science, and evaluation with using prompt "0-shot" (without examples). model choose one from possible options answer, one (A, B, C or D) for each question, and which-or other in answer • Self-reported
MMLU-STEM
Evaluation with help benchmark MMLU-STEM AI: Please, me text, which need to In I only query on translation, but text for translation • Self-reported
MultiPL-E
Evaluation by benchmark MultiPL-E
AI: Benchmarking for code generation through evaluation of model performance, running model-generated code in different programming languages.
We test model performance across 18 different programming languages using the MultiPL-E benchmark. This benchmark is designed to test code generation capabilities by having models complete programming problems given a description and starter code.
The code is executed automatically to check correctness. This benchmark is a modified version of the original HumanEval benchmark, which was designed for Python only, but expanded to cover many more languages. • Self-reported
TheoremQA
# Evaluation with using benchmark TheoremQA TheoremQA — this benchmark from 800 questions about by various subjects, including analysis, numbers, probability, and etc.etc. He in mathematics, and tasks by complexity (and ). How and in case with GPQA, we used code TheoremQA for evaluation results. Results not are fully — some answers can verification • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
September 19, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsQwen2.5-Coder 32B Instruct
Alibaba
32.0B
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$0.09/1M tokens
Qwen2.5 72B Instruct
Alibaba
72.7B
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$1.20/1M tokens
Qwen2 72B Instruct
Alibaba
72.0B
Best score:0.9 (HumanEval)
Released:Jul 2024
Qwen3 30B A3B
Alibaba
30.5B
Best score:0.7 (GPQA)
Released:Apr 2025
Price:$0.10/1M tokens
Qwen2.5 14B Instruct
Alibaba
14.7B
Best score:0.8 (HumanEval)
Released:Sep 2024
Qwen3 32B
Alibaba
32.8B
Released:Apr 2025
Price:$0.40/1M tokens
QwQ-32B
Alibaba
32.5B
Best score:0.7 (GPQA)
Released:Mar 2025
Qwen3.5 27B
Alibaba
27.0B
Released:Mar 2026
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.