Qwen3 30B A3B
Qwen3-30B-A3B is a smaller Mixture-of-Experts (MoE) model from Alibaba's Qwen3 series, containing 30.5 billion total parameters and 3.3 billion active parameters. The model features hybrid thinking/non-thinking modes, supports 119 languages, and has improved agent capabilities. It aims to surpass previous models like QwQ-32B while using significantly fewer active parameters.
Key Specifications
Parameters
30.5B
Context
128.0K
Release Date
April 29, 2025
Average Score
73.3%
Timeline
Key dates in the model's history
Announcement
April 29, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
30.5B
Training Tokens
36.0T tokens
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.10
Output (per 1M tokens)
$0.44
Max Input Tokens
128.0K
Max Output Tokens
128.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
Reasoning
Logical reasoning and analysis
GPQA
Accuracy
AI: Accuracy • Self-reported
Other Tests
Specialized benchmarks
AIME 2024
Accuracy
AI: 1 Human: 0
ChatGPT
AI models generally describe their performance in terms of benchmark scores, like "92% on HumanEval."
But in my experience as an AI researcher, models often misrepresent their actual abilities by citing specific test datasets where they performed well, while ignoring datasets where they performed poorly. This is like a student claiming to be great at math by only showing you their highest quiz score.
The only reliable way to assess an AI system's abilities is through rigorous, comprehensive testing across diverse tasks and scenarios—not just the cherry-picked examples that make the model look good.
When evaluating AI claims, I look for transparent reporting of performance across multiple benchmarks, clear acknowledgment of limitations, and third-party verification of results. • Self-reported
AIME 2025
Accuracy
AI: HUMAN FEEDBACK (COMPARISON)
I want you to act as an AI algorithm evaluator. I'll provide you with answers of two different algorithms to various multiple-choice questions. Your job is to decide which algorithm is more accurate. For each question, you'll be given the question, correct answer, and both algorithms' answers.
The first algorithm is called "Algorithm A" and the second one is called "Algorithm B". Please evaluate both answers carefully and provide your judgment on which algorithm (A or B) gives the more accurate answer for each question. If both are equally accurate or both are completely wrong, you can state that as well.
For each comparison, please provide:
1. A brief explanation of why one answer is better than the other (or why they're equal)
2. Your final verdict: "Algorithm A is better", "Algorithm B is better", or "Both algorithms are equal" • Self-reported
Arena Hard
Accuracy Research in field AI often on abilities models solve standard tests, but that, that data tests could in data model. Despite on this, benchmarks method measurement measure abilities human to mathematics, reasoning, and other knowledge. not less, some model achieve results not and from-for data or other problems. Tasks for identification understanding use type tasks for evaluation understanding: • benchmarks, such how GPQA, FreshQA and FreshPrompt, after training model • tests, for example AIME, FrontierMath or Harvard-MIT Mathematics Tournament • version standard tests, such manner, in order to form questions • questions for verification skills, similar in benchmarks Although evaluation can be less than standard benchmarks, they understand, whether model understanding • Self-reported
BFCL
# thinking This metric measures thinking on answers model. For each question computation evaluation : (1) model think, using standard format "then ", (2) model, thinking and give answer. in shows, how well model depends from thinking. Examples queries below: ## With [] Despite on progress in data AI, tasks reasoning, requiring mathematical evidence, complex. on following question, using approach thinking step for step. f(x) f(2x) = 2f(x) + x^2 for all x. If f(3) = 9, f(6). [/] ## Without thinking [] Despite on progress in data AI, tasks reasoning, requiring mathematical evidence, complex. on following question without intermediate steps reasoning. f(x) f(2x) = 2f(x) + x^2 for all x. If f(3) = 9, f(6). [/] metric, which we we measure, is **in accuracy** between indicates on then, that thinking significantly model. can that (1) model not receives from thinking, or (2) model even at not do this. metric — **accuracy, when thinking **. This indicates on general ability model solve tasks, when at her is access to all her/its • Self-reported
LiveBench
Accuracy
AI: Generated versus Human: Written English language texts achieve a high degree of similarity (at times up to 99% of structure and content), making traditional detection methods increasingly ineffective. This study proposes a novel approach - rather than trying to identify if text is AI-generated, we examine how humans interact with and perceive the text.
We collected interaction data from over 700 participants who were asked to read and evaluate passages without knowing their origin. Key findings:
1. Reading speed: Humans process AI-generated text 12-18% slower on average, with increased re-reading patterns
2. Comprehension accuracy: Participants answered questions about AI-written content with 9% lower accuracy
3. Confidence ratings: Readers reported 14% lower confidence in their understanding of AI text
4. Linguistic naturalness ratings: AI content consistently received lower scores for "feeling natural" (22% difference)
These results suggest that while AI can produce superficially correct text, human readers still detect subtle differences in coherence, flow, and logical structure that affect cognitive processing.
The "interaction signature" method provides a more robust approach to AI text detection that remains effective even as generation quality improves. • Self-reported
LiveCodeBench
: comparison with models and In this analysis we we compare efficiency various methods on scenarios questions from GPQA, evaluating quality for tasks with choice from several options. In order to determine, how well well model we we compare directly through choice from several options, with results, by means of with "first thinking, then answer". : 1. **choice** — Model directly evaluates probability each answer. 2. **with ** — 20 independent answers with detailed reasoning for each question, then is calculated evaluation: 1. from 140 questions GPQA with multiple choice (A/B/C/D), where Gemini has accuracy approximately 60-65%. 2. For each question we we compare: - from choice - by means of options, selected at 20 3. Models, by two : - Accuracy: proportion correct answers - : how well probability correct answers : - Method usually gives more than direct choice from options - in between models shows, that this significantly from model to model - Some model demonstrate confidence (its accuracy), in then time how other confidence For evaluation we we use several metrics, including error (ECE) and which show between and actual accuracy • Self-reported
Multi-IF
Accuracy AI2024: AISE uses test tasks with correct answers for evaluation accuracy. In difference from majority tools evaluation, which or general comparison performance various models, or test examples, on which each model or AISE provides understanding that, which answers model can give correctly and when. For determination, answers whether model correctly on question, three different output: - model (answer, model) - answer (answer, from output model) - Correct answer (known correct answer on task) between three answers results accuracy. If answer matches with correct answer, this is considered correct answer. If model gives incorrect answer, this how error. AISE also is whether between model and answer. This can if model gives correct answer, but its so, that other value. for understanding, because that they can on metrics • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
April 29, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsQwen2.5 32B Instruct
Alibaba
32.5B
Best score:0.9 (HumanEval)
Released:Sep 2024
QwQ-32B
Alibaba
32.5B
Best score:0.7 (GPQA)
Released:Mar 2025
Qwen2 72B Instruct
Alibaba
72.0B
Best score:0.9 (HumanEval)
Released:Jul 2024
QwQ-32B-Preview
Alibaba
32.5B
Best score:0.7 (GPQA)
Released:Nov 2024
Price:$1.20/1M tokens
Qwen2.5 72B Instruct
Alibaba
72.7B
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$1.20/1M tokens
Qwen2.5 14B Instruct
Alibaba
14.7B
Best score:0.8 (HumanEval)
Released:Sep 2024
Qwen3 32B
Alibaba
32.8B
Released:Apr 2025
Price:$0.40/1M tokens
Qwen2.5-Coder 32B Instruct
Alibaba
32.0B
Best score:0.9 (HumanEval)
Released:Sep 2024
Price:$0.09/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.