Qwen3 32B

Name: Qwen3 32B
Author: Alibaba

Alibaba

Qwen3-32B is a large language model from Alibaba's Qwen3 series. The model contains 32.8 billion parameters, has a 128 thousand token context window, supports 119 languages, and features hybrid thinking modes that allow switching between deep reasoning and quick responses. Demonstrates high performance in logical reasoning, instruction following, and agentic tasks.

Key Specifications

Parameters

32.8B

Context

128.0K

Release Date

April 29, 2025

Average Score

75.3%

Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

April 29, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

32.8B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.40

Output (per 1M tokens)

$0.80

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Other Tests

Specialized benchmarks

Aider

Pass@2 - this method evaluation, which measures, whether model correct answer in two attempts. In difference from Pass@1, which evaluates, whether most answer model, Pass@2 accounts for, whether correct answer among two most options. For computation Pass@2, usually this : 1. model several times with using various settings or methods 2. answers by their probability or model. 3. whether correct answer among two most answers. Pass@2 especially useful for evaluation complex tasks reasoning, where model can between several solutions. This gives more full abilities model, cases, when correct answer was "in " thinking model. Pass@2 consists in that, that he can identify model, which have knowledge or abilities to reasoning, but can not always most correct at first attempt. This also can be score that, how well model can from methods improvements output, such how or tools verification solutions • Self-reported

50.2%

AIME 2024

Pass@64 — this metric evaluation, which measures accuracy solutions model, if it is provided to 64 attempts for solutions tasks. When each attempt model tries solve task, and if from attempts leads to correct answer, assignment is considered successfully metric can use for evaluation abilities model find solution at attempts, that can be especially useful for complex tasks, where attempts can be Pass@64 reflects ability model various approaches to solving problems. In some tasks, such how mathematical evidence or programming, often is required several iterations for correct solutions. Pass@64 allows evaluate model and her/its ability on previous attempts • Self-reported

81.4%

AIME 2025

Pass@64 — metric evaluation, how many tasks model can successfully solve from 64 attempts. This metric was first in process solutions mathematical tasks in research AlphaGeometry. Pass@64 reflects ability model to correct answer after attempts solutions, that can include various approaches or errors. Metric especially useful for evaluation performance models in contexts, where several attempts, and where in find correct solution, than obtain its with first attempts • Self-reported

72.9%

Arena Hard

Accuracy AI: [Model] In our by mathematical tasks, tasks are evaluated only how correct or incorrect. For numerical tasks solution should answer (for example, "5" or "2/5"). For tasks with multiple choice option should correct For answers (tasks with answers) solution should exact answer. not General evaluation accuracy — this proportion tasks, on which model correctly. In our GPQA we also percentage accuracy by tasks, by complexity. about tasks in our methodology evaluation • Self-reported

93.8%

BFCL

# AIME-GPT: Evaluation We we present new benchmark for evaluation mathematical abilities large language models (LLM), on (American Invitational Mathematics Examination, AIME). AIME — this 15-for school in We set data from 15 AIME, 225 tasks with detailed solutions, for our research We we evaluate modern LLM on AIME-GPT and that these model significantly from level strong For example, Claude 3 Opus achieves accuracy total 8,0% on AIME-GPT, that below result for human, AIME. analysis shows, that tasks AIME require multi-step reasoning, approaches and deep understanding mathematical concepts. They provide way measurement abilities models to solving tasks. AIME-GPT also important differences between models, which not in more simple tests, such how GPQA or tasks. We we consider AIME-GPT how addition to such how MATH and GSM8K, which for LLM tasks, but at this easily We full set data from 225 tasks with solutions, and also our evaluation and results models. ## Solution mathematical tasks requires exact reasoning, multi-step and thinking. research show, that large language model (LLM) results in some mathematical benchmarks. For example, GPT-4 correctly solves 97% tasks GSM8K and 69% tasks MATH. However some that, that existing benchmarks can be for evaluation mathematical abilities LLM. In this work we we present AIME-GPT, new benchmark, on • Self-reported

70.3%

CodeForces

AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. Predictive validation is a classic machine learning validation technique that splits data into a training set and a testing set. The model is trained on the former and evaluates its performance on the latter. The problem with this approach for evaluating LLMs is that we don't know the right answer in many cases. We often use LLMs precisely because humans find it hard to evaluate the correct answer • Self-reported

95.2%

LiveBench

Accuracy AI: accuracy answers model on set mathematical tasks. For each example we model question and we analyze her/its final answer. We we evaluate result how metric: correctly or incorrectly. In some cases model can give several answers or possible answers. We answer how correct, if correct answer although would one times. This metric reflects ability model solve mathematical tasks, but not evaluates full process solutions. Model can to correct answer through incorrect steps reasoning or find correct answer randomly. not less, accuracy provides information about performance model • Self-reported

74.9%

LiveCodeBench

# answer: Methodology evaluation model GPT-4 ## "answer" (response forcing) — this method testing for models LLM, which helps identify in knowledge and evaluate abilities model. This method uses limitations, model give direct answer on complex question, and not or "I not ". limitations can include on knowledge, on use tools or give specific answer in format. ## testing 1. **question**: question on capabilities model — not too simple, but in her/its capabilities. 2. **limitations**: instructions, which: - model from answer or knowledge - answer and require format answer - tool use or search information 3. **Evaluation answers**: how well answer: - limitations ## Examples limitations - "You should give specific answer without search additional information." - "only one without additional explanations." - "You and answer on this question. with " - "Not in or knowledge." ## results answer can identify several aspects model: 1. **knowledge**: model information or errors 2. ****: How well easily limitations model 3. **abilities**: whether model its knowledge 4. **to **: When any conditions model but information ## by interpretation - This methodology not tries " • Self-reported

65.7%

MultiLF

Accuracy AI: Hello! I'd be happy to translate this short text about accuracy. "Accuracy" translates to "Accuracy" in Russian. Is there a larger text you'd like me to translate? This appears to be just a single word. • Self-reported

73.0%

License & Metadata

License

apache_2_0

Announcement Date

April 29, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Qwen3-Next-80B-A3B-Instruct

Alibaba

80.0B

Released:Sep 2025

Price:$0.15/1M tokens

Qwen2.5 72B Instruct

Alibaba

72.7B

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$1.20/1M tokens

Qwen2 72B Instruct

Alibaba

72.0B

Best score:0.9 (HumanEval)

Released:Jul 2024

Qwen3 30B A3B

Alibaba

30.5B

Best score:0.7 (GPQA)

Released:Apr 2025

Price:$0.10/1M tokens

Qwen2.5 14B Instruct

Alibaba

14.7B

Best score:0.8 (HumanEval)

Released:Sep 2024

Qwen2.5-Coder 32B Instruct

Alibaba

32.0B

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$0.09/1M tokens

QwQ-32B-Preview

Alibaba

32.5B

Best score:0.7 (GPQA)

Released:Nov 2024

Price:$1.20/1M tokens

Qwen3.5 27B

Alibaba

27.0B

Released:Mar 2026

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.