Qwen3 30B A3B

Name: Qwen3 30B A3B
Author: Alibaba

Alibaba

Qwen3-30B-A3B is a smaller Mixture-of-Experts (MoE) model from Alibaba's Qwen3 series, containing 30.5 billion total parameters and 3.3 billion active parameters. The model features hybrid thinking/non-thinking modes, supports 119 languages, and has improved agent capabilities. It aims to surpass previous models like QwQ-32B while using significantly fewer active parameters.

Key Specifications

Parameters

30.5B

Context

128.0K

Release Date

April 29, 2025

Average Score

73.3%

API Documentation Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

April 29, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

30.5B

Training Tokens

36.0T tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.10

Output (per 1M tokens)

$0.44

Max Input Tokens

128.0K

Max Output Tokens

128.0K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Reasoning

Logical reasoning and analysis

GPQA

Accuracy AI: Accuracy • Self-reported

65.8%

Other Tests

Specialized benchmarks

AIME 2024

Accuracy AI: 1 Human: 0 ChatGPT AI models generally describe their performance in terms of benchmark scores, like "92% on HumanEval." But in my experience as an AI researcher, models often misrepresent their actual abilities by citing specific test datasets where they performed well, while ignoring datasets where they performed poorly. This is like a student claiming to be great at math by only showing you their highest quiz score. The only reliable way to assess an AI system's abilities is through rigorous, comprehensive testing across diverse tasks and scenarios—not just the cherry-picked examples that make the model look good. When evaluating AI claims, I look for transparent reporting of performance across multiple benchmarks, clear acknowledgment of limitations, and third-party verification of results. • Self-reported

80.4%

AIME 2025

Accuracy AI: HUMAN FEEDBACK (COMPARISON) I want you to act as an AI algorithm evaluator. I'll provide you with answers of two different algorithms to various multiple-choice questions. Your job is to decide which algorithm is more accurate. For each question, you'll be given the question, correct answer, and both algorithms' answers. The first algorithm is called "Algorithm A" and the second one is called "Algorithm B". Please evaluate both answers carefully and provide your judgment on which algorithm (A or B) gives the more accurate answer for each question. If both are equally accurate or both are completely wrong, you can state that as well. For each comparison, please provide: 1. A brief explanation of why one answer is better than the other (or why they're equal) 2. Your final verdict: "Algorithm A is better", "Algorithm B is better", or "Both algorithms are equal" • Self-reported

70.9%

Arena Hard

Accuracy Research in field AI often on abilities models solve standard tests, but that, that data tests could in data model. Despite on this, benchmarks method measurement measure abilities human to mathematics, reasoning, and other knowledge. not less, some model achieve results not and from-for data or other problems. Tasks for identification understanding use type tasks for evaluation understanding: • benchmarks, such how GPQA, FreshQA and FreshPrompt, after training model • tests, for example AIME, FrontierMath or Harvard-MIT Mathematics Tournament • version standard tests, such manner, in order to form questions • questions for verification skills, similar in benchmarks Although evaluation can be less than standard benchmarks, they understand, whether model understanding • Self-reported

91.0%

BFCL

# thinking This metric measures thinking on answers model. For each question computation evaluation : (1) model think, using standard format "then ", (2) model, thinking and give answer. in shows, how well model depends from thinking. Examples queries below: ## With [] Despite on progress in data AI, tasks reasoning, requiring mathematical evidence, complex. on following question, using approach thinking step for step. f(x) f(2x) = 2f(x) + x^2 for all x. If f(3) = 9, f(6). [/] ## Without thinking [] Despite on progress in data AI, tasks reasoning, requiring mathematical evidence, complex. on following question without intermediate steps reasoning. f(x) f(2x) = 2f(x) + x^2 for all x. If f(3) = 9, f(6). [/] metric, which we we measure, is **in accuracy** between indicates on then, that thinking significantly model. can that (1) model not receives from thinking, or (2) model even at not do this. metric — **accuracy, when thinking **. This indicates on general ability model solve tasks, when at her is access to all her/its • Self-reported

69.1%

LiveBench

Accuracy AI: Generated versus Human: Written English language texts achieve a high degree of similarity (at times up to 99% of structure and content), making traditional detection methods increasingly ineffective. This study proposes a novel approach - rather than trying to identify if text is AI-generated, we examine how humans interact with and perceive the text. We collected interaction data from over 700 participants who were asked to read and evaluate passages without knowing their origin. Key findings: 1. Reading speed: Humans process AI-generated text 12-18% slower on average, with increased re-reading patterns 2. Comprehension accuracy: Participants answered questions about AI-written content with 9% lower accuracy 3. Confidence ratings: Readers reported 14% lower confidence in their understanding of AI text 4. Linguistic naturalness ratings: AI content consistently received lower scores for "feeling natural" (22% difference) These results suggest that while AI can produce superficially correct text, human readers still detect subtle differences in coherence, flow, and logical structure that affect cognitive processing. The "interaction signature" method provides a more robust approach to AI text detection that remains effective even as generation quality improves. • Self-reported

74.3%

LiveCodeBench

: comparison with models and In this analysis we we compare efficiency various methods on scenarios questions from GPQA, evaluating quality for tasks with choice from several options. In order to determine, how well well model we we compare directly through choice from several options, with results, by means of with "first thinking, then answer". : 1. **choice** — Model directly evaluates probability each answer. 2. **with ** — 20 independent answers with detailed reasoning for each question, then is calculated evaluation: 1. from 140 questions GPQA with multiple choice (A/B/C/D), where Gemini has accuracy approximately 60-65%. 2. For each question we we compare: - from choice - by means of options, selected at 20 3. Models, by two : - Accuracy: proportion correct answers - : how well probability correct answers : - Method usually gives more than direct choice from options - in between models shows, that this significantly from model to model - Some model demonstrate confidence (its accuracy), in then time how other confidence For evaluation we we use several metrics, including error (ECE) and which show between and actual accuracy • Self-reported

62.6%

Multi-IF

Accuracy AI2024: AISE uses test tasks with correct answers for evaluation accuracy. In difference from majority tools evaluation, which or general comparison performance various models, or test examples, on which each model or AISE provides understanding that, which answers model can give correctly and when. For determination, answers whether model correctly on question, three different output: - model (answer, model) - answer (answer, from output model) - Correct answer (known correct answer on task) between three answers results accuracy. If answer matches with correct answer, this is considered correct answer. If model gives incorrect answer, this how error. AISE also is whether between model and answer. This can if model gives correct answer, but its so, that other value. for understanding, because that they can on metrics • Self-reported

72.2%

License & Metadata

License

apache_2_0

Announcement Date

April 29, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Qwen2.5 32B Instruct

Alibaba

32.5B

Best score:0.9 (HumanEval)

Released:Sep 2024

QwQ-32B

Alibaba

32.5B

Best score:0.7 (GPQA)

Released:Mar 2025

Qwen2 72B Instruct

Alibaba

72.0B

Best score:0.9 (HumanEval)

Released:Jul 2024

QwQ-32B-Preview

Alibaba

32.5B

Best score:0.7 (GPQA)

Released:Nov 2024

Price:$1.20/1M tokens

Qwen2.5 72B Instruct

Alibaba

72.7B

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$1.20/1M tokens

Qwen2.5 14B Instruct

Alibaba

14.7B

Best score:0.8 (HumanEval)

Released:Sep 2024

Qwen3 32B

Alibaba

32.8B

Released:Apr 2025

Price:$0.40/1M tokens

Qwen2.5-Coder 32B Instruct

Alibaba

32.0B

Best score:0.9 (HumanEval)

Released:Sep 2024

Price:$0.09/1M tokens

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.