Qwen 3.5 Is Alibaba's Bid to Win the Agentic AI Era

Alibaba's Qwen 3.5 family uses extreme MoE efficiency to beat models 7x its size. The flagship is now live on Arena — and claims to outperform GPT-5.2.

Alibaba's latest model family doesn't just iterate on its predecessor — it fundamentally changes the math on what's possible with sparse inference. Qwen 3.5, launched in February and now live on LMSys Arena as Qwen3.5-Max-Preview, claims to outperform GPT-5.2, Claude Opus 4.5, and Gemini 3 on key benchmarks while activating a fraction of the parameters.

The flagship model has 397 billion total parameters but fires only 17 billion per token. Alibaba says it's 60% cheaper to run than the previous generation and 8x more efficient at processing large workloads. Those are bold claims — and the early benchmark numbers suggest they're not entirely hype.

The Model Lineup

Qwen 3.5 ships as a family of four open-weight models under Apache 2.0, plus hosted variants through Alibaba Cloud:

Model	Total Params	Active Params	Architecture
Qwen3.5-397B-A17B	397B	17B	Sparse MoE
Qwen3.5-122B-A10B	122B	10B	Sparse MoE
Qwen3.5-35B-A3B	35B	3B	Sparse MoE
Qwen3.5-27B	27B	27B	Dense

The efficiency story is most dramatic with the 35B-A3B variant. Despite activating only 3 billion parameters per token, it beats Qwen3-235B-A22B — a model roughly seven times larger — on core benchmarks. That's not a marginal improvement. It's a generational leap in architecture efficiency, achieved through better training data, reinforcement learning, and a new attention mechanism.

Architecture: What Changed

The biggest technical departure from Qwen 3 is the adoption of Gated DeltaNet, a linear attention variant that replaces full quadratic attention in many layers. Traditional transformer attention scales quadratically with sequence length — double the context, quadruple the compute. DeltaNet scales closer to linearly, which is how Qwen 3.5 handles context windows up to 262,000 tokens natively, extendable to roughly one million through RoPE scaling.

The MoE design uses high sparsity — the 397B flagship activates less than 5% of its total parameters per token. An FP8 training and inference pipeline cuts activation memory by approximately 50%, and multi-step prediction improves long-horizon planning. The hosted versions (Qwen3.5-Plus and Qwen3.5-Flash) offer one-million-token context out of the box.

The tokenizer expanded to 248,320 tokens, covering 201 languages and dialects — up from 119 in Qwen 3.

Benchmark Performance

The flagship model's numbers across categories:

Benchmark	Score	Category
AIME'26	91.3	Math reasoning
OmniDocBench v1.5	90.8	Document understanding
MMLU	88.5	General knowledge
GPQA Diamond	88.4	Expert reasoning
Video-MME	87.5	Video reasoning
LiveCodeBench v6	83.6	Coding
MMMU-Pro	79.0	Visual reasoning
BrowseComp	78.6	Agentic search
SWE-bench Verified	76.4	Real-world programming
IFBench	76.5	Instruction following
BFCL v4	72.9	Tool use

The 91.3 on AIME'26 and 76.4 on SWE-bench are particularly notable — these are the kinds of benchmarks where even small improvements require meaningful capability gains. Alibaba explicitly claims superiority over OpenAI's GPT-5.2, Anthropic's Claude Opus 4.5, and Google's Gemini 3, describing Qwen 3.5 as "built for the Agentic AI Era."

Native Multimodal, Native Agentic

Unlike earlier Qwen releases that shipped separate text and vision-language models, Qwen 3.5 uses a unified vision-language backbone with early fusion. That means the same model handles text, images, video, and documents without switching between specialized variants.

The agentic capabilities are built in rather than bolted on: tool calling, web browsing, code interpretation, and long-horizon planning are part of the base model. Alibaba highlights visual agentic capabilities — the model can take autonomous actions across mobile and desktop applications, reading screens and interacting with UI elements.

A scalable asynchronous RL framework supports speculative decoding, rollout replay, and multi-turn rollout locking, which means the model can plan and execute multi-step tasks more reliably than models trained purely on next-token prediction.

The Competitive Landscape

Qwen 3.5 arrives during an intense period for Chinese AI. ByteDance launched Doubao 2.0 the same weekend, also targeting the agent era. Alibaba trails ByteDance's Doubao chatbot in weekly active users — 155 million versus DeepSeek's 81.6 million — and ran a 3-billion-yuan ($433M) marketing campaign allowing food and beverage purchases through the Qwen chatbot, resulting in a 7x jump in active users.

The Arena preview, deployed on March 19 with full release expected in early April, is Alibaba's way of letting the community validate the benchmark claims independently. It follows the pattern set by NVIDIA's Nemotron Cascade and other recent releases that prioritize public evaluation over controlled announcements.

Who This Is For

For developers running local inference, the 35B-A3B model is the standout — GPT-5-class performance at a fraction of the compute cost, under an Apache 2.0 license. It's small enough to run on consumer hardware with quantization, yet capable enough for production agentic workloads.

For enterprises, the hosted Qwen3.5-Plus and Max variants offer million-token context windows and managed infrastructure through Alibaba Cloud. The open-weight models give teams the option to self-host with full control over data and fine-tuning.

The message from Alibaba is clear: the model arms race is no longer about who can build the biggest model. It's about who can deliver the most capability per unit of compute. With Qwen 3.5, they're making a strong case that sparse MoE at extreme ratios — 397 billion parameters, 17 billion active — is the architecture that gets you there.

Qwen 3.5 Is Alibaba's Bid to Win the Agentic AI Era

The Model Lineup

Architecture: What Changed

Benchmark Performance

Native Multimodal, Native Agentic

The Competitive Landscape

Who This Is For

Related Articles

Qwen 3.6 Quietly Appears on OpenRouter With a Million-Token Context

Alibaba Doubles Down on Open Source at ModelScope DevCon

ik_llama.cpp Delivers 26x Faster Prompt Processing for Qwen 3.5