Qwen3 32B
Qwen3-32B is a large language model from Alibaba's Qwen3 series. The model contains 32.8 billion parameters, has a 128 thousand token context window, supports 119 languages, and features hybrid thinking modes that allow switching between deep reasoning and quick responses. Demonstrates high performance in logical reasoning, instruction following, and agentic tasks.
Key Specifications
Parameters
32.8B
Context
128.0K
Release Date
April 29, 2025
Average Score
75.3%
Timeline
Key dates in the model's history
Announcement
April 29, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
32.8B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.40
Output (per 1M tokens)
$0.80
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
Other Tests
Specialized benchmarks
Aider
Pass@2 - this method evaluation, which measures, whether model correct answer in two attempts. In difference from Pass@1, which evaluates, whether most answer model, Pass@2 accounts for, whether correct answer among two most options. For computation Pass@2, usually this : 1. model several times with using various settings or methods 2. answers by their probability or model. 3. whether correct answer among two most answers. Pass@2 especially useful for evaluation complex tasks reasoning, where model can between several solutions. This gives more full abilities model, cases, when correct answer was "in " thinking model. Pass@2 consists in that, that he can identify model, which have knowledge or abilities to reasoning, but can not always most correct at first attempt. This also can be score that, how well model can from methods improvements output, such how or tools verification solutions • Self-reported
AIME 2024
Pass@64 — this metric evaluation, which measures accuracy solutions model, if it is provided to 64 attempts for solutions tasks. When each attempt model tries solve task, and if from attempts leads to correct answer, assignment is considered successfully metric can use for evaluation abilities model find solution at attempts, that can be especially useful for complex tasks, where attempts can be Pass@64 reflects ability model various approaches to solving problems. In some tasks, such how mathematical evidence or programming, often is required several iterations for correct solutions. Pass@64 allows evaluate model and her/its ability on previous attempts • Self-reported
AIME 2025
Pass@64 — metric evaluation, how many tasks model can successfully solve from 64 attempts. This metric was first in process solutions mathematical tasks in research AlphaGeometry. Pass@64 reflects ability model to correct answer after attempts solutions, that can include various approaches or errors. Metric especially useful for evaluation performance models in contexts, where several attempts, and where in find correct solution, than obtain its with first attempts • Self-reported
Arena Hard
Accuracy AI: [Model] In our by mathematical tasks, tasks are evaluated only how correct or incorrect. For numerical tasks solution should answer (for example, "5" or "2/5"). For tasks with multiple choice option should correct For answers (tasks with answers) solution should exact answer. not General evaluation accuracy — this proportion tasks, on which model correctly. In our GPQA we also percentage accuracy by tasks, by complexity. about tasks in our methodology evaluation • Self-reported
BFCL
# AIME-GPT: Evaluation We we present new benchmark for evaluation mathematical abilities large language models (LLM), on (American Invitational Mathematics Examination, AIME). AIME — this 15-for school in We set data from 15 AIME, 225 tasks with detailed solutions, for our research We we evaluate modern LLM on AIME-GPT and that these model significantly from level strong For example, Claude 3 Opus achieves accuracy total 8,0% on AIME-GPT, that below result for human, AIME. analysis shows, that tasks AIME require multi-step reasoning, approaches and deep understanding mathematical concepts. They provide way measurement abilities models to solving tasks. AIME-GPT also important differences between models, which not in more simple tests, such how GPQA or tasks. We we consider AIME-GPT how addition to such how MATH and GSM8K, which for LLM tasks, but at this easily We full set data from 225 tasks with solutions, and also our evaluation and results models. ## Solution mathematical tasks requires exact reasoning, multi-step and thinking. research show, that large language model (LLM) results in some mathematical benchmarks. For example, GPT-4 correctly solves 97% tasks GSM8K and 69% tasks MATH. However some that, that existing benchmarks can be for evaluation mathematical abilities LLM. In this work we we present AIME-GPT, new benchmark, on • Self-reported
CodeForces
AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. Predictive validation is a classic machine learning validation technique that splits data into a training set and a testing set. The model is trained on the former and evaluates its performance on the latter. The problem with this approach for evaluating LLMs is that we don't know the right answer in many cases. We often use LLMs precisely because humans find it hard to evaluate the correct answer • Self-reported
LiveBench
Accuracy AI: accuracy answers model on set mathematical tasks. For each example we model question and we analyze her/its final answer. We we evaluate result how metric: correctly or incorrectly. In some cases model can give several answers or possible answers. We answer how correct, if correct answer although would one times. This metric reflects ability model solve mathematical tasks, but not evaluates full process solutions. Model can to correct answer through incorrect steps reasoning or find correct answer randomly. not less, accuracy provides information about performance model • Self-reported
LiveCodeBench
# answer: Methodology evaluation model GPT-4 ## "answer" (response forcing) — this method testing for models LLM, which helps identify in knowledge and evaluate abilities model. This method uses limitations, model give direct answer on complex question, and not or "I not ". limitations can include on knowledge, on use tools or give specific answer in format. ## testing 1. **question**: question on capabilities model — not too simple, but in her/its capabilities. 2. **limitations**: instructions, which: - model from answer or knowledge - answer and require format answer - tool use or search information 3. **Evaluation answers**: how well answer: - limitations ## Examples limitations - "You should give specific answer without search additional information." - "only one without additional explanations." - "You and answer on this question. with " - "Not in or knowledge." ## results answer can identify several aspects model: 1. **knowledge**: model information or errors 2. ****: How well easily limitations model 3. **abilities**: whether model its knowledge 4. **to **: When any conditions model but information ## by interpretation - This methodology not tries " • Self-reported
MultiLF
Accuracy
AI: Hello! I'd be happy to translate this short text about accuracy.
"Accuracy" translates to "Accuracy" in Russian.
Is there a larger text you'd like me to translate? This appears to be just a single word. • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
April 29, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsQwen3-Next-80B-A3B-Instruct
Alibaba
80.0B
Released:Sep 2025
Price:$0.15/1M tokens
Qwen2.5 72B Instruct
Alibaba
72.7B
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$1.20/1M tokens
Qwen2 72B Instruct
Alibaba
72.0B
Best score:0.9 (HumanEval)
Released:Jul 2024
Qwen3 30B A3B
Alibaba
30.5B
Best score:0.7 (GPQA)
Released:Apr 2025
Price:$0.10/1M tokens
Qwen2.5 14B Instruct
Alibaba
14.7B
Best score:0.8 (HumanEval)
Released:Sep 2024
Qwen2.5-Coder 32B Instruct
Alibaba
32.0B
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$0.09/1M tokens
QwQ-32B-Preview
Alibaba
32.5B
Best score:0.7 (GPQA)
Released:Nov 2024
Price:$1.20/1M tokens
Qwen3.5 27B
Alibaba
27.0B
Released:Mar 2026
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.