All News
qwenalibabareleasebenchmarksmoe

Qwen 3.5 Is Alibaba's Bid to Win the Agentic AI Era

Alibaba's Qwen 3.5 family uses extreme MoE efficiency to beat models 7x its size. The flagship is now live on Arena — and claims to outperform GPT-5.2.

Vlad MakarovVlad Makarovreviewed and published
7 min read
Qwen 3.5 Is Alibaba's Bid to Win the Agentic AI Era

Alibaba's latest model family doesn't just iterate on its predecessor — it fundamentally changes the math on what's possible with sparse inference. Qwen 3.5, launched in February and now live on LMSys Arena as Qwen3.5-Max-Preview, claims to outperform GPT-5.2, Claude Opus 4.5, and Gemini 3 on key benchmarks while activating a fraction of the parameters.

The flagship model has 397 billion total parameters but fires only 17 billion per token. Alibaba says it's 60% cheaper to run than the previous generation and 8x more efficient at processing large workloads. Those are bold claims — and the early benchmark numbers suggest they're not entirely hype.

The Model Lineup

Qwen 3.5 ships as a family of four open-weight models under Apache 2.0, plus hosted variants through Alibaba Cloud:

ModelTotal ParamsActive ParamsArchitecture
Qwen3.5-397B-A17B397B17BSparse MoE
Qwen3.5-122B-A10B122B10BSparse MoE
Qwen3.5-35B-A3B35B3BSparse MoE
Qwen3.5-27B27B27BDense

The efficiency story is most dramatic with the 35B-A3B variant. Despite activating only 3 billion parameters per token, it beats Qwen3-235B-A22B — a model roughly seven times larger — on core benchmarks. That's not a marginal improvement. It's a generational leap in architecture efficiency, achieved through better training data, reinforcement learning, and a new attention mechanism.

Architecture: What Changed

The biggest technical departure from Qwen 3 is the adoption of Gated DeltaNet, a linear attention variant that replaces full quadratic attention in many layers. Traditional transformer attention scales quadratically with sequence length — double the context, quadruple the compute. DeltaNet scales closer to linearly, which is how Qwen 3.5 handles context windows up to 262,000 tokens natively, extendable to roughly one million through RoPE scaling.

The MoE design uses high sparsity — the 397B flagship activates less than 5% of its total parameters per token. An FP8 training and inference pipeline cuts activation memory by approximately 50%, and multi-step prediction improves long-horizon planning. The hosted versions (Qwen3.5-Plus and Qwen3.5-Flash) offer one-million-token context out of the box.

The tokenizer expanded to 248,320 tokens, covering 201 languages and dialects — up from 119 in Qwen 3.

Benchmark Performance

The flagship model's numbers across categories:

BenchmarkScoreCategory
AIME'2691.3Math reasoning
OmniDocBench v1.590.8Document understanding
MMLU88.5General knowledge
GPQA Diamond88.4Expert reasoning
Video-MME87.5Video reasoning
LiveCodeBench v683.6Coding
MMMU-Pro79.0Visual reasoning
BrowseComp78.6Agentic search
SWE-bench Verified76.4Real-world programming
IFBench76.5Instruction following
BFCL v472.9Tool use

The 91.3 on AIME'26 and 76.4 on SWE-bench are particularly notable — these are the kinds of benchmarks where even small improvements require meaningful capability gains. Alibaba explicitly claims superiority over OpenAI's GPT-5.2, Anthropic's Claude Opus 4.5, and Google's Gemini 3, describing Qwen 3.5 as "built for the Agentic AI Era."

Native Multimodal, Native Agentic

Unlike earlier Qwen releases that shipped separate text and vision-language models, Qwen 3.5 uses a unified vision-language backbone with early fusion. That means the same model handles text, images, video, and documents without switching between specialized variants.

The agentic capabilities are built in rather than bolted on: tool calling, web browsing, code interpretation, and long-horizon planning are part of the base model. Alibaba highlights visual agentic capabilities — the model can take autonomous actions across mobile and desktop applications, reading screens and interacting with UI elements.

A scalable asynchronous RL framework supports speculative decoding, rollout replay, and multi-turn rollout locking, which means the model can plan and execute multi-step tasks more reliably than models trained purely on next-token prediction.

The Competitive Landscape

Qwen 3.5 arrives during an intense period for Chinese AI. ByteDance launched Doubao 2.0 the same weekend, also targeting the agent era. Alibaba trails ByteDance's Doubao chatbot in weekly active users — 155 million versus DeepSeek's 81.6 million — and ran a 3-billion-yuan ($433M) marketing campaign allowing food and beverage purchases through the Qwen chatbot, resulting in a 7x jump in active users.

The Arena preview, deployed on March 19 with full release expected in early April, is Alibaba's way of letting the community validate the benchmark claims independently. It follows the pattern set by NVIDIA's Nemotron Cascade and other recent releases that prioritize public evaluation over controlled announcements.

Who This Is For

For developers running local inference, the 35B-A3B model is the standout — GPT-5-class performance at a fraction of the compute cost, under an Apache 2.0 license. It's small enough to run on consumer hardware with quantization, yet capable enough for production agentic workloads.

For enterprises, the hosted Qwen3.5-Plus and Max variants offer million-token context windows and managed infrastructure through Alibaba Cloud. The open-weight models give teams the option to self-host with full control over data and fine-tuning.

The message from Alibaba is clear: the model arms race is no longer about who can build the biggest model. It's about who can deliver the most capability per unit of compute. With Qwen 3.5, they're making a strong case that sparse MoE at extreme ratios — 397 billion parameters, 17 billion active — is the architecture that gets you there.

Related Articles

Scroll down

to load the next article