Alibaba logo

Qwen3 32B

Alibaba

Qwen3-32B is a large language model from Alibaba's Qwen3 series. The model contains 32.8 billion parameters, has a 128 thousand token context window, supports 119 languages, and features hybrid thinking modes that allow switching between deep reasoning and quick responses. Demonstrates high performance in logical reasoning, instruction following, and agentic tasks.

Key Specifications

Parameters
32.8B
Context
128.0K
Release Date
April 29, 2025
Average Score
75.3%

Timeline

Key dates in the model's history
Announcement
April 29, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
32.8B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.40
Output (per 1M tokens)
$0.80
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Other Tests

Specialized benchmarks
Aider
Pass@2 - this method evaluation, which measures, whether model correct answer in two attempts. In difference from Pass@1, which evaluates, whether most answer model, Pass@2 accounts for, whether correct answer among two most options. For computation Pass@2, usually this : 1. model several times with using various settings or methods 2. answers by their probability or model. 3. whether correct answer among two most answers. Pass@2 especially useful for evaluation complex tasks reasoning, where model can between several solutions. This gives more full abilities model, cases, when correct answer was "in " thinking model. Pass@2 consists in that, that he can identify model, which have knowledge or abilities to reasoning, but can not always most correct at first attempt. This also can be score that, how well model can from methods improvements output, such how or tools verification solutionsSelf-reported
50.2%
AIME 2024
Pass@64 — this metric evaluation, which measures accuracy solutions model, if it is provided to 64 attempts for solutions tasks. When each attempt model tries solve task, and if from attempts leads to correct answer, assignment is considered successfully metric can use for evaluation abilities model find solution at attempts, that can be especially useful for complex tasks, where attempts can be Pass@64 reflects ability model various approaches to solving problems. In some tasks, such how mathematical evidence or programming, often is required several iterations for correct solutions. Pass@64 allows evaluate model and her/its ability on previous attemptsSelf-reported
81.4%
AIME 2025
Pass@64 — metric evaluation, how many tasks model can successfully solve from 64 attempts. This metric was first in process solutions mathematical tasks in research AlphaGeometry. Pass@64 reflects ability model to correct answer after attempts solutions, that can include various approaches or errors. Metric especially useful for evaluation performance models in contexts, where several attempts, and where in find correct solution, than obtain its with first attemptsSelf-reported
72.9%
Arena Hard
Accuracy AI: [Model] In our by mathematical tasks, tasks are evaluated only how correct or incorrect. For numerical tasks solution should answer (for example, "5" or "2/5"). For tasks with multiple choice option should correct For answers (tasks with answers) solution should exact answer. not General evaluation accuracy — this proportion tasks, on which model correctly. In our GPQA we also percentage accuracy by tasks, by complexity. about tasks in our methodology evaluationSelf-reported
93.8%
BFCL
# AIME-GPT: Evaluation We we present new benchmark for evaluation mathematical abilities large language models (LLM), on (American Invitational Mathematics Examination, AIME). AIME — this 15-for school in We set data from 15 AIME, 225 tasks with detailed solutions, for our research We we evaluate modern LLM on AIME-GPT and that these model significantly from level strong For example, Claude 3 Opus achieves accuracy total 8,0% on AIME-GPT, that below result for human, AIME. analysis shows, that tasks AIME require multi-step reasoning, approaches and deep understanding mathematical concepts. They provide way measurement abilities models to solving tasks. AIME-GPT also important differences between models, which not in more simple tests, such how GPQA or tasks. We we consider AIME-GPT how addition to such how MATH and GSM8K, which for LLM tasks, but at this easily We full set data from 225 tasks with solutions, and also our evaluation and results models. ## Solution mathematical tasks requires exact reasoning, multi-step and thinking. research show, that large language model (LLM) results in some mathematical benchmarks. For example, GPT-4 correctly solves 97% tasks GSM8K and 69% tasks MATH. However some that, that existing benchmarks can be for evaluation mathematical abilities LLM. In this work we we present AIME-GPT, new benchmark, onSelf-reported
70.3%
CodeForces
AI: Translate on Russian language following text method analysis. ONLY translation, without quotes, without without explanations. Predictive validation is a classic machine learning validation technique that splits data into a training set and a testing set. The model is trained on the former and evaluates its performance on the latter. The problem with this approach for evaluating LLMs is that we don't know the right answer in many cases. We often use LLMs precisely because humans find it hard to evaluate the correct answerSelf-reported
95.2%
LiveBench
Accuracy AI: accuracy answers model on set mathematical tasks. For each example we model question and we analyze her/its final answer. We we evaluate result how metric: correctly or incorrectly. In some cases model can give several answers or possible answers. We answer how correct, if correct answer although would one times. This metric reflects ability model solve mathematical tasks, but not evaluates full process solutions. Model can to correct answer through incorrect steps reasoning or find correct answer randomly. not less, accuracy provides information about performance modelSelf-reported
74.9%
LiveCodeBench
# answer: Methodology evaluation model GPT-4 ## "answer" (response forcing) — this method testing for models LLM, which helps identify in knowledge and evaluate abilities model. This method uses limitations, model give direct answer on complex question, and not or "I not ". limitations can include on knowledge, on use tools or give specific answer in format. ## testing 1. **question**: question on capabilities model — not too simple, but in her/its capabilities. 2. **limitations**: instructions, which: - model from answer or knowledge - answer and require format answer - tool use or search information 3. **Evaluation answers**: how well answer: - limitations ## Examples limitations - "You should give specific answer without search additional information." - "only one without additional explanations." - "You and answer on this question. with " - "Not in or knowledge." ## results answer can identify several aspects model: 1. **knowledge**: model information or errors 2. ****: How well easily limitations model 3. **abilities**: whether model its knowledge 4. **to **: When any conditions model but information ## by interpretation - This methodology not tries "Self-reported
65.7%
MultiLF
Accuracy AI: Hello! I'd be happy to translate this short text about accuracy. "Accuracy" translates to "Accuracy" in Russian. Is there a larger text you'd like me to translate? This appears to be just a single word.Self-reported
73.0%

License & Metadata

License
apache_2_0
Announcement Date
April 29, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.