All News
turboquantquantizationlocal-llmapplemacbookqwen

TurboQuant Now Runs Qwen 3.5-9B on a MacBook Air

A developer used Google's TurboQuant compression in llama.cpp to run Qwen 3.5-9B with 20K context on a base MacBook Air M4 with just 16GB of RAM.

Vlad MakarovVlad Makarovreviewed and published
3 min read
Mentioned models
TurboQuant Now Runs Qwen 3.5-9B on a MacBook Air

A $1,099 laptop running a 9-billion-parameter model with 20K token context. That's the demo a developer posted on r/LocalLLaMA this week — and the 1,032 upvotes and 177 comments suggest nobody expected it to actually work this well.

What Happened

The setup is almost comically modest: a base-model MacBook Air with an M4 chip and 16GB of unified memory. The developer compiled llama.cpp with Google's TurboQuant compression baked in and loaded Qwen 3.5-9B — a model that would normally choke on hardware this constrained, especially at longer context lengths.

TurboQuant's trick is compressing the KV cache from 32 bits down to 3 bits, a roughly 6x reduction in the memory that grows with every token the model processes. That KV cache is the bottleneck that decides whether a model fits in your RAM or crashes trying. At 32-bit precision, running Qwen 3.5-9B with a 20K context window would blow past 16GB easily. At 3 bits, it fits with room to spare.

Google published the TurboQuant paper less than a week ago. The community had it integrated into llama.cpp within 24 hours. A Medium article titled "TurboQuant: Local Agent Swarms with 4M-Token Context on $5K Desktop" started circulating the same day, and StarkInsider called it "the unsexy AI breakthrough worth watching." The r/LocalLLaMA explanation post hit 1,186 upvotes — one of the highest-scoring posts this week. Even the inevitable "Me waiting for TurboQuant" meme pulled 568 upvotes.

Why This Matters

Running large models locally has been a rich-person hobby. You needed a Mac Studio with 192GB of memory, or a desktop with multiple high-end GPUs, or you rented cloud compute. The practical floor for anything beyond toy-sized models was $3,000-5,000 in hardware.

TurboQuant rewrites that math. Models that previously demanded 64GB+ of RAM now run on 16GB machines. That's not an incremental improvement — it opens local LLM inference to anyone with a current-generation laptop. Students, indie developers, people in countries where cloud API costs add up fast.

The RotorQuant project that appeared three days ago pushes even further, running 10-19x faster than TurboQuant's own compression step. And with Apple reportedly killing the 512GB Mac Studio, the timing is almost poetic: just as Apple removes the high-memory option, the community makes it unnecessary.

What's Next

The immediate question is whether Apple's MLX framework will get native TurboQuant support, bypassing the llama.cpp route entirely. Given how fast the community moved on the initial integration, it's hard to imagine this taking more than a few weeks. For now, the llama.cpp path works — and a $1,099 MacBook Air just became a surprisingly capable local AI machine.

Related Articles

Scroll down

to load the next article