Google's TurboQuant Compresses AI Models to 2.5 Bits Without Breaking Them

A new quantization method from Google Research achieves 4.9x KV cache compression with zero accuracy loss. No training required — it works on any model instantly.

Most quantization methods start falling apart below 4 bits. Google Research just published results at 2.5 bits with zero accuracy loss. The technique is called TurboQuant, and the part that matters most: it requires no training, no calibration data, and works instantly on any model.

How It Actually Works

TurboQuant is a two-stage compression framework targeting the KV cache — the memory bottleneck that limits how many tokens a model can process at once. The approach is built on two papers presented at ICLR 2026 and AISTATS 2026.

The first stage, called PolarQuant, applies a random rotation to the input vectors. In high dimensions, this rotation forces each coordinate into a concentrated, predictable distribution — a Beta distribution that's nearly identical regardless of the original data. Because the distribution is known in advance, PolarQuant can map data onto a fixed circular grid without storing the normalization constants that traditional quantization methods require. Those constants typically add 1-2 extra bits of overhead per number, partially negating the compression gains. PolarQuant eliminates them entirely.

The second stage uses QJL — a Quantized Johnson-Lindenstrauss transform — to correct residual errors from the first stage. Each remaining value gets reduced to a single sign bit. The math guarantees that inner product estimates remain unbiased, which is critical because transformer attention scores are inner products.

Together, the two stages produce an overall bit-width of b bits with provably unbiased estimation — the theoretical property that makes this more than just another approximation hack.

The Numbers

Community testing on Qwen 3.5 35B confirmed the headline claims:

2.5-bit quantization: 4.9x smaller KV cache, 100% exact match across all context lengths
3.5-bit quantization: 3.8x smaller KV cache, 100% exact match
4-bit quantization: 8x speedup in attention logit computation on NVIDIA H100 GPUs

On the Needle-in-a-Haystack benchmark — the standard test for whether compression breaks long-context retrieval — TurboQuant achieved 100% perfect retrieval accuracy up to 104K tokens under 4x compression. No degradation.

For context, the indexing speed comparison against Product Quantization is almost absurd. Where PQ needs 494 seconds to index 3072-dimensional vectors, TurboQuant does it in 0.002 seconds — roughly 250,000 times faster.

What This Changes

The practical impact is immediate. KV cache memory is the primary constraint on how many tokens a model can process and how many concurrent users a server can handle. Compressing it by 4-5x without accuracy loss means longer context windows on the same hardware, more concurrent inference sessions, and substantially lower cloud compute costs.

Within 24 hours of publication, community ports appeared for MLX (Apple Silicon) and llama.cpp. For local AI users running models on consumer hardware — the same community currently celebrating GPU prices dropping — TurboQuant effectively multiplies their available VRAM by a factor of four.

The approach is also training-free, which means it works on any existing fine-tuned model. You don't need to retrain or recalibrate. Run TurboQuant on your Llama, Qwen, or Gemma checkpoint and the compressed version is ready immediately.

Google hasn't announced whether TurboQuant will be integrated into Gemini's inference pipeline, but given that KV cache compression directly addresses Gemini's long-context bottleneck, it would be surprising if it weren't already in internal testing. For the rest of the ecosystem, the papers and code are public — and the 7MB browser model we covered earlier this week could benefit enormously from this kind of compression.

Google's TurboQuant Compresses AI Models to 2.5 Bits Without Breaking Them

How It Actually Works

The Numbers

What This Changes

Related Articles

Google Handles a Billion Health Questions a Day. Should We Be Worried?

Google Drops Gemma 4 Under Apache 2.0 — And It Runs on a Raspberry Pi

LLMs Confabulate Like Split-Brain Patients