Google Drops Gemma 4 Under Apache 2.0 — And It Runs on a Raspberry Pi
Google DeepMind releases Gemma 4, a family of four open models with agentic skills, 256K context, and multimodal capabilities — all under Apache 2.0 license.

Remember when we asked where Gemma 4 was? The wait is over. On April 2, Google DeepMind released an entire family of four models that bring frontier-level reasoning, native multimodality, and agentic capabilities to everything from cloud servers down to a Raspberry Pi 5. And the most significant change might not be technical at all — it's the license.
Apache 2.0 Changes the Game
For two years, Gemma models shipped under a custom Google license that made enterprise legal teams nervous. Usage restrictions, terms Google could update unilaterally, compliance edge cases that required lawyers before engineers could start building. Many teams chose Qwen or Mistral instead, not because Gemma was worse, but because the licensing was cleaner elsewhere.
Gemma 4 ships under a standard Apache 2.0 license — the same permissive terms used by practically every other serious open-weight model. No custom clauses, no "Harmful Use" carve-outs requiring legal interpretation, no restrictions on redistribution or commercial deployment. For enterprises that evaluate open models through procurement, this removes a real bottleneck.
The timing is interesting. While some Chinese AI labs — notably Alibaba with its latest Qwen 3.6 releases — have been pulling back from fully open releases, Google is moving in the opposite direction, opening its most capable Gemma ever while explicitly stating the architecture draws from commercial Gemini 3 research.
Four Models, Two Tiers
The family splits into workstation and edge tiers:
| Model | Parameters | Active Params | Context | Modalities | Target |
|---|---|---|---|---|---|
| Gemma 4 31B | 31B (dense) | 31B | 256K | Text, Vision | Workstation / Cloud |
| Gemma 4 26B A4B | 25.2B (MoE) | 3.8B | 256K | Text, Vision | Workstation / Consumer GPU |
| Gemma 4 E4B | ~5B | 4B effective | 128K | Text, Vision, Audio | Edge / Mobile |
| Gemma 4 E2B | 5.1B | 2.3B effective | 128K | Text, Vision, Audio | IoT / Raspberry Pi |
The naming takes a moment to parse. The "E" prefix means "effective parameters" — the E2B has 2.3 billion effective parameters but 5.1 billion total, because each decoder layer carries its own small embedding table through a technique called Per-Layer Embeddings. These tables are large on disk but cheap to compute, so the model runs like a 2B while technically weighing more.
The "A" in 26B A4B stands for "active parameters." Only 3.8 billion of the MoE model's 25.2 billion total parameters activate during inference. That means 26B-class intelligence at roughly 4B compute costs — a significant advantage for anyone paying per-token.
The MoE Architecture: 128 Small Experts
Where most large MoE models use a handful of big experts, Google went with 128 small ones, activating eight per token plus one shared always-on expert. The practical result is a model that benchmarks alongside dense 27B–31B models while running at 4B-class throughput during inference. Fewer GPUs, lower latency, cheaper per-token costs.
Both workstation models use a hybrid attention mechanism — interleaving local sliding window attention with full global attention — which enables the 256K context window without blowing up memory. The final layer is always global, ensuring coherent long-range reasoning.
Benchmarks That Would Have Been Frontier-Class Last Year
The numbers show a generational leap:
| Benchmark | Gemma 4 31B | Gemma 4 26B MoE | Gemma 3 27B | Note |
|---|---|---|---|---|
| AIME 2026 | 89.2% | 88.3% | 20.8%* | Math reasoning |
| LiveCodeBench v6 | 80.0% | 77.1% | 29.1%* | Coding |
| Codeforces ELO | 2,150 | — | — | Competitive programming |
| GPQA Diamond | — | 82.3% | — | Graduate-level science |
| MMMU Pro | 76.9% | — | — | Visual understanding |
| MATH-Vision | 85.6% | — | — | Visual math |
*Gemma 3 scores without thinking mode.
The gap between MoE and dense variants is modest given the massive inference cost difference. And the edge models punch well above their weight: E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench — numbers that would have been impressive for a full-size model not long ago.
Multimodal From the Ground Up
Previous open models typically treated vision and audio as bolt-ons. Gemma 4 integrates them at the architecture level. All four models handle variable aspect-ratio images with configurable visual token budgets (70 to 1,120 tokens per image), letting developers trade detail against compute depending on the task.
The two edge models add native audio processing — automatic speech recognition and speech-to-translated-text, all on-device. The audio encoder has been compressed to 305 million parameters (down from 681M in Gemma 3n) while the frame duration dropped from 160ms to 40ms for snappier transcription.
Function calling is also native across all four models, drawing on Google's FunctionGemma research. Unlike approaches that rely on instruction-following to coax structured tool use, Gemma 4's function calling was trained into the model from the start, optimized for multi-turn agentic flows with multiple tools.
It Actually Runs on a Raspberry Pi
This isn't marketing fluff. Google's LiteRT-LM framework achieves a prefill throughput of 133 tokens per second and decode throughput of 7.6 tokens per second on the E2B running on a Raspberry Pi 5. The E2B uses less than 1.5GB of memory on some devices thanks to 2-bit and 4-bit weight support.
The broader platform story matters for developers: Gemma 4 works across Android (via AICore), iOS, Windows, Linux, macOS (Metal), WebGPU in browsers, and Qualcomm IQ8 NPU platforms. Google also launched a new Python CLI tool for experimenting with Gemma on any machine — including tool calling support — without writing code.
For on-device AI, the Agent Skills feature in Google AI Edge Gallery demonstrates what Gemma 4 can do autonomously: querying knowledge bases, generating interactive visualizations, integrating with other models for music or image generation, and building multi-step workflows entirely through conversation.
What This Means
The real competitive angle isn't any single benchmark. It's the combination: strong reasoning, native multimodal across text, vision, and audio, function calling, 256K context, and a genuinely permissive license — all in one model family with deployment from IoT to cloud. That combination didn't exist in open-weight land before this week.
For enterprises evaluating open models, the Apache 2.0 license means the evaluation can start without a call to legal. For startups building agentic products, the MoE variant offers frontier-adjacent quality at dramatically lower serving costs. And for the local LLM community that just celebrated llama.cpp hitting 100K stars, Gemma 4 adds another powerful option that should appear in Ollama and LM Studio within days.
Google has hinted that this may not be the complete Gemma 4 family, with additional sizes likely to follow. But what's available today — combined with the TurboQuant compression that Google also recently released — paints a picture of a company that's serious about making its best research accessible to everyone, not just paying API customers.


