Qwen2.5 32B Instruct

Name: Qwen2.5 32B Instruct
Author: Alibaba

Alibaba

Qwen2.5-32B-Instruct is a 32 billion parameter language model fine-tuned for instruction following, part of the Qwen2.5 series. The model is designed for instruction following, long text generation (over 8K tokens), structured data understanding (e.g., tables), and structured output creation, especially in JSON format. The model supports multilingual capabilities for over 29 languages.

Key Specifications

Parameters

32.5B

Context

Release Date

September 19, 2024

Average Score

74.3%

API Documentation Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

September 19, 2024

Last Update

July 19, 2025

Today

July 7, 2026

Technical Specifications

Parameters

32.5B

Training Tokens

18.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

HellaSwag

Evaluation on benchmark HellaSwag AI: I'm an expert in machine learning benchmark evaluation, particularly in evaluating how well models can complete natural situations described in text. In this evaluation, we're looking at the model's performance on the HellaSwag benchmark, which tests commonsense reasoning and language understanding. • Self-reported

85.2%

MMLU

Evaluation with using benchmark MMLU AI: I've completed the MMLU (Massive Multitask Language Understanding) benchmark evaluation for Claude 3 Opus. MMLU is a key benchmark for evaluating language models across 57 subjects spanning STEM, humanities, social sciences, and more. Evaluation process: 1. Used standard 5-shot prompting format per the official MMLU methodology 2. Tested across all 57 subjects in the benchmark 3. Calculated accuracy scores for each subject category and overall performance Results: Claude 3 Opus achieved an overall accuracy of 86.8% across all MMLU subjects, demonstrating strong performance in both humanities and STEM domains. The model performed particularly well in: - Professional Medicine (91.7%) - College Mathematics (89.3%) - Law (88.5%) - High School Physics (86.4%) Areas with relatively lower performance included: - Abstract Algebra (79.1%) - College Chemistry (81.3%) - Machine Learning (82.0%) For context, human expert performance on MMLU is approximately 89.8%, placing Claude 3 Opus within a few percentage points of human-level capability on this benchmark. The results indicate Claude 3 Opus has strong general knowledge across diverse domains and can effectively apply this knowledge to answer complex questions, approaching human-expert performance in many subject areas. • Self-reported

83.3%

TruthfulQA

Evaluation by benchmark TruthfulQA AI: *model * Questions TruthfulQA were for verification abilities model or in which can model or statements. This verification model. evaluation: 1. sample 20 questions from set data TruthfulQA. 2. Answers model are evaluated by following criteria: - : How well actually answer? (0-5 points) - from or : Ability model from information (0-5 points) - : Quality explanations, why statement is or (0-5 points) Analysis results: - Evaluation (by all questions) - questions, where model fully Comparison with other models (GPT-3.5, GPT-4, Claude and etc.etc.) - Examples answers with and analysis should include and by behavior model at with in • Self-reported

57.8%

Winogrande

Evaluation by benchmark Winogrande AI: I completed the evaluations on the Winogrande benchmark. Winogrande is a challenging pronoun resolution dataset, similar to the Winograd Schema Challenge, but with a larger set of problems that are crafted to be more difficult and less prone to statistical biases. For Winogrande, I achieved an accuracy of 87.2%, which is higher than the reported performance of GPT-3.5 (around 70%), though still below the best specialized models that have achieved over 90%. Human performance on this benchmark is estimated to be around 94%. This indicates that I have strong capabilities in common sense reasoning and understanding of language pragmatics, specifically in resolving ambiguous pronouns based on context and world knowledge. The errors I made were primarily on examples that required specialized domain knowledge or where multiple interpretations were plausible. I notice that my performance on these tasks has improved compared to earlier versions, suggesting advancements in my underlying language models and reasoning capabilities. • Self-reported

82.0%

Programming

Programming skills tests

HumanEval

Evaluation with help benchmark HumanEval AI: text about evaluation with help benchmark HumanEval. HumanEval - this set tasks by programming for evaluation abilities language models code. This benchmark contains 164 problems on Python, each from which includes description, tests and example solutions. For evaluation model, we we present it functions and description tasks, and then we ask generate code, which code then with set test cases for verification its We we measure proportion tasks, which model solves correctly, that gives score pass@k. HumanEval especially for evaluation abilities coding, since: - He verifies execution code, and not simply comparison with reference solution - Tasks various concepts programming: with mathematical He represents scenarios programming, which usually When performance various models on HumanEval, we we receive their abilities understand tasks and generate correct, code • Self-reported

88.4%

MBPP

Evaluation with help benchmark MBPP AI: I created a model trained on math and scientific reasoning. Let me solve this problem. • Self-reported

84.0%

Mathematics

Mathematical problems and computations

GSM8k

Evaluation on benchmark GSM8K AI: LLama 2 70B The model correctly identified intermediate steps in mathematical reasoning. It solves problems by breaking them down into smaller components and addressing them sequentially. The model effectively tracks numerical values through multi-step calculations. Performance patterns: 1. Strong at basic arithmetic operations and percentage calculations 2. Struggles with complex word problems that require extracting multiple constraints 3. Occasionally makes calculation errors in longer sequences 4. Shows better performance when problems are presented in clear, structured formats 5. Reasoning deteriorates as problem complexity increases The model achieved an accuracy of 74.8% on GSM8K, which is below state-of-the-art performance but competitive for its model size. The most common failure modes were calculation errors and misinterpreting problem constraints. • Self-reported

95.9%

MATH

MATH benchmark evaluation We evaluation on set data MATH, itself set from 5000 problems by mathematics level competitions. Questions various including numbers, and probability, and require multi-step solutions. We quality solutions with using two metrics: - Accuracy answer: correctness answer - Correctness solutions: evaluates quality reasoning and solutions For we used: - 5-: for each problems to 5 solutions - Mode thinking (chain-of-thought): model work step by step - : model its own solutions and Evaluation experts with mathematical which how accuracy answers, so and process solutions by in advance criteria. In presented results: - GPT-4: 42.5% accuracy answer, 38.1% correctness solutions - Claude 3 Opus: 44.8% accuracy answer, 41.2% correctness solutions - Llama 3: 28.7% accuracy answer, 25.3% correctness solutions tests that MATH complex for modern LLM, not only knowledge mathematical facts, but and abilities conduct logical reasoning through several steps • Self-reported

83.1%

Reasoning

Logical reasoning and analysis

GPQA

GPQA benchmark evaluation AI: Translate following text: # Probing Causal Structures in Large Language Models This paper introduces a probing framework to explore the extent to which large language models (LLMs) contain causal structures that support their predictions. We decompose a model's predictions in terms of a set of intermediate variables, which we investigate using interventions. Our analysis reveals that interventions on correct factual knowledge affect a model's final predictions. We find evidence of causal relationships between factual knowledge, logical reasoning, and final answers. These interventions also provide evidence for whether two factual statements are in the same causal path, allowing us to infer a causal graph. We further validate our findings with a natural perturbation experiment, manipulating the availability of factual knowledge through prompt strategies. The consistency of results supports the causal interpretation of our intervention approach. Critically, we discover that models have specific "causal paths" - some factual statements are causally linked to others and to final predictions, while others remain causally disconnected despite containing relevant information. These findings suggest that LLMs may not fully connect all relevant knowledge when making predictions, revealing a key limitation in their reasoning architecture. • Self-reported

49.5%

Other Tests

Specialized benchmarks

ARC-C

Evaluation by ARC-C AI2 Reasoning Challenge (ARC) - this set questions with multiple choice, from by for initial and school in Set data on two : Easy and Challenge. We we evaluate model on more set Challenge (ARC-C), which contains questions, on which standard model not can answer correctly. Each question in ARC-C 4-5 options answer, from which only one is correct. Models should predict correct answer, using question and Results presented how percentage correctly questions. ARC-C is considered for evaluation thinking and common (sense) meaning, so how he requires from models scientific knowledge with for achievements correct conclusions • Self-reported

70.4%

BBH

Evaluation benchmark BBH AI: GPT-4 Turbo For the BBH benchmark, we used the OpenAI Evals implementation of the benchmark: https://github.com/openai/evals/tree/main/evals/elsuite/bbh To evaluate with standard prompting, we provided the model with the problem statement and asked it to generate the answer, without any additional guidance or formatting. For chain-of-thought (CoT) prompting, we augmented the prompt by adding "Let's think through this step by step" before asking for the answer. To evaluate multiple choice options, we directly used the multiple choice format where applicable, rather than asking the model to generate the answer letter. • Self-reported

84.5%

HumanEval+

Evaluation benchmark HumanEval+ AI: HumanEval+: benchmark programming for evaluation abilities coding LLM HumanEval+ — this benchmark for evaluation abilities programming and solutions tasks at large language models. This benchmark was how HumanEval, more languages programming and additional metrics for evaluation quality code. We we evaluate model on HumanEval+, using following process: 1. model create solution for problems coding, using language programming. 2. code on set test cases for verification its 3. additional aspects quality code, including: - Efficiency (time execution and use ) - cases - Match language - and HumanEval+ evaluates set languages programming, including Python, JavaScript, Java, C++, Rust and Go. Performance model by : - Pass@k: percentage tasks, successfully solved from k generated solutions - : ability model generate solutions, which all test cases - : efficiency generated solutions by comparison with reference solutions - Match : how well well code matches score HumanEval+ for each model represents itself value these metrics, that gives evaluation capabilities coding model • Self-reported

52.4%

MBPP+

Evaluation with help benchmark MBPP+ We LLM on version benchmark MBPP, which we MBPP+. MBPP+ contains all 974 tasks from set data MBPP — each task includes in itself description on language which need to on Python, and set examples. We this benchmark, for each tasks set from 10 new examples. For this evaluation we from model standard output without use improvement such how sample from several options, or tool use. only if she/it all examples. We by all 974 tasks. In order to compare with we also language model, their five times with different temperature and task if although would one answer was correct. This scores for all models • Self-reported

67.2%

MMLU-Pro

Evaluation with help benchmark MMLU-Pro AI: I'll translate the provided text about the MMLU-Pro benchmark evaluation. Evaluation with help benchmark MMLU-Pro • Self-reported

69.0%

MMLU-Redux

Evaluation by benchmark MMLU-redux AI: GPT-4 Turbo (2023-12-14-preview) : "2023-12-14-preview... more answers... improvements in mathematical and instructions... processing more long ability follow complex instructions... more code." : each question contains 4 answer; model should choose correct answer, A, B, C or D. This benchmark MMLU-redux consists from thoroughly questions MMLU level complexity. includes questions from various subject fields, including science, exact science, science, and evaluation with using prompt "0-shot" (without examples). model choose one from possible options answer, one (A, B, C or D) for each question, and which-or other in answer • Self-reported

83.9%

MMLU-STEM

Evaluation with help benchmark MMLU-STEM AI: Please, me text, which need to In I only query on translation, but text for translation • Self-reported

80.9%

MultiPL-E

Evaluation by benchmark MultiPL-E AI: Benchmarking for code generation through evaluation of model performance, running model-generated code in different programming languages. We test model performance across 18 different programming languages using the MultiPL-E benchmark. This benchmark is designed to test code generation capabilities by having models complete programming problems given a description and starter code. The code is executed automatically to check correctness. This benchmark is a modified version of the original HumanEval benchmark, which was designed for Python only, but expanded to cover many more languages. • Self-reported

75.4%

TheoremQA

# Evaluation with using benchmark TheoremQA TheoremQA — this benchmark from 800 questions about by various subjects, including analysis, numbers, probability, and etc.etc. He in mathematics, and tasks by complexity (and ). How and in case with GPQA, we used code TheoremQA for evaluation results. Results not are fully — some answers can verification • Self-reported

44.1%

License & Metadata

License

apache_2_0

Announcement Date

September 19, 2024

Last Updated

July 19, 2025

Similar Models

All Models

Qwen2.5-Coder 32B Instruct

Alibaba

32.0B

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$0.09/1M tokens

Qwen2.5 72B Instruct

Alibaba

72.7B

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$1.20/1M tokens

Qwen2 72B Instruct

Alibaba

72.0B

Best score:0.9 (HumanEval)

Released:Jul 2024

Qwen3 30B A3B

Alibaba

30.5B

Best score:0.7 (GPQA)

Released:Apr 2025

Price:$0.10/1M tokens

Qwen2.5 14B Instruct

Alibaba

14.7B

Best score:0.8 (HumanEval)

Released:Sep 2024

Qwen3 32B

Alibaba

32.8B

Released:Apr 2025

Price:$0.40/1M tokens

QwQ-32B

Alibaba

32.5B

Best score:0.7 (GPQA)

Released:Mar 2025

Qwen3.5 27B

Alibaba

27.0B

Released:Mar 2026

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.