Google Drops Gemma 4 Under Apache 2.0 — And It Runs on a Raspberry Pi

Google DeepMind releases Gemma 4, a family of four open models with agentic skills, 256K context, and multimodal capabilities — all under Apache 2.0 license.

Remember when we asked where Gemma 4 was? The wait is over. On April 2, Google DeepMind released an entire family of four models that bring frontier-level reasoning, native multimodality, and agentic capabilities to everything from cloud servers down to a Raspberry Pi 5. And the most significant change might not be technical at all — it's the license.

Apache 2.0 Changes the Game

For two years, Gemma models shipped under a custom Google license that made enterprise legal teams nervous. Usage restrictions, terms Google could update unilaterally, compliance edge cases that required lawyers before engineers could start building. Many teams chose Qwen or Mistral instead, not because Gemma was worse, but because the licensing was cleaner elsewhere.

Gemma 4 ships under a standard Apache 2.0 license — the same permissive terms used by practically every other serious open-weight model. No custom clauses, no "Harmful Use" carve-outs requiring legal interpretation, no restrictions on redistribution or commercial deployment. For enterprises that evaluate open models through procurement, this removes a real bottleneck.

The timing is interesting. While some Chinese AI labs — notably Alibaba with its latest Qwen 3.6 releases — have been pulling back from fully open releases, Google is moving in the opposite direction, opening its most capable Gemma ever while explicitly stating the architecture draws from commercial Gemini 3 research.

Four Models, Two Tiers

The family splits into workstation and edge tiers:

Model	Parameters	Active Params	Context	Modalities	Target
Gemma 4 31B	31B (dense)	31B	256K	Text, Vision	Workstation / Cloud
Gemma 4 26B A4B	25.2B (MoE)	3.8B	256K	Text, Vision	Workstation / Consumer GPU
Gemma 4 E4B	~5B	4B effective	128K	Text, Vision, Audio	Edge / Mobile
Gemma 4 E2B	5.1B	2.3B effective	128K	Text, Vision, Audio	IoT / Raspberry Pi

The naming takes a moment to parse. The "E" prefix means "effective parameters" — the E2B has 2.3 billion effective parameters but 5.1 billion total, because each decoder layer carries its own small embedding table through a technique called Per-Layer Embeddings. These tables are large on disk but cheap to compute, so the model runs like a 2B while technically weighing more.

The "A" in 26B A4B stands for "active parameters." Only 3.8 billion of the MoE model's 25.2 billion total parameters activate during inference. That means 26B-class intelligence at roughly 4B compute costs — a significant advantage for anyone paying per-token.

The MoE Architecture: 128 Small Experts

Where most large MoE models use a handful of big experts, Google went with 128 small ones, activating eight per token plus one shared always-on expert. The practical result is a model that benchmarks alongside dense 27B–31B models while running at 4B-class throughput during inference. Fewer GPUs, lower latency, cheaper per-token costs.

Both workstation models use a hybrid attention mechanism — interleaving local sliding window attention with full global attention — which enables the 256K context window without blowing up memory. The final layer is always global, ensuring coherent long-range reasoning.

Benchmarks That Would Have Been Frontier-Class Last Year

The numbers show a generational leap:

Benchmark	Gemma 4 31B	Gemma 4 26B MoE	Gemma 3 27B	Note
AIME 2026	89.2%	88.3%	20.8%*	Math reasoning
LiveCodeBench v6	80.0%	77.1%	29.1%*	Coding
Codeforces ELO	2,150	—	—	Competitive programming
GPQA Diamond	—	82.3%	—	Graduate-level science
MMMU Pro	76.9%	—	—	Visual understanding
MATH-Vision	85.6%	—	—	Visual math

*Gemma 3 scores without thinking mode.

The gap between MoE and dense variants is modest given the massive inference cost difference. And the edge models punch well above their weight: E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench — numbers that would have been impressive for a full-size model not long ago.

Multimodal From the Ground Up

Previous open models typically treated vision and audio as bolt-ons. Gemma 4 integrates them at the architecture level. All four models handle variable aspect-ratio images with configurable visual token budgets (70 to 1,120 tokens per image), letting developers trade detail against compute depending on the task.

The two edge models add native audio processing — automatic speech recognition and speech-to-translated-text, all on-device. The audio encoder has been compressed to 305 million parameters (down from 681M in Gemma 3n) while the frame duration dropped from 160ms to 40ms for snappier transcription.

Function calling is also native across all four models, drawing on Google's FunctionGemma research. Unlike approaches that rely on instruction-following to coax structured tool use, Gemma 4's function calling was trained into the model from the start, optimized for multi-turn agentic flows with multiple tools.

It Actually Runs on a Raspberry Pi

This isn't marketing fluff. Google's LiteRT-LM framework achieves a prefill throughput of 133 tokens per second and decode throughput of 7.6 tokens per second on the E2B running on a Raspberry Pi 5. The E2B uses less than 1.5GB of memory on some devices thanks to 2-bit and 4-bit weight support.

The broader platform story matters for developers: Gemma 4 works across Android (via AICore), iOS, Windows, Linux, macOS (Metal), WebGPU in browsers, and Qualcomm IQ8 NPU platforms. Google also launched a new Python CLI tool for experimenting with Gemma on any machine — including tool calling support — without writing code.

For on-device AI, the Agent Skills feature in Google AI Edge Gallery demonstrates what Gemma 4 can do autonomously: querying knowledge bases, generating interactive visualizations, integrating with other models for music or image generation, and building multi-step workflows entirely through conversation.

What This Means

The real competitive angle isn't any single benchmark. It's the combination: strong reasoning, native multimodal across text, vision, and audio, function calling, 256K context, and a genuinely permissive license — all in one model family with deployment from IoT to cloud. That combination didn't exist in open-weight land before this week.

For enterprises evaluating open models, the Apache 2.0 license means the evaluation can start without a call to legal. For startups building agentic products, the MoE variant offers frontier-adjacent quality at dramatically lower serving costs. And for the local LLM community that just celebrated llama.cpp hitting 100K stars, Gemma 4 adds another powerful option that should appear in Ollama and LM Studio within days.

Google has hinted that this may not be the complete Gemma 4 family, with additional sizes likely to follow. But what's available today — combined with the TurboQuant compression that Google also recently released — paints a picture of a company that's serious about making its best research accessible to everyone, not just paying API customers.

Google Drops Gemma 4 Under Apache 2.0 — And It Runs on a Raspberry Pi

Apache 2.0 Changes the Game

Four Models, Two Tiers

The MoE Architecture: 128 Small Experts

Benchmarks That Would Have Been Frontier-Class Last Year

Multimodal From the Ground Up

It Actually Runs on a Raspberry Pi

What This Means

Related Articles

Where Is Gemma 4? The Community Is Getting Impatient

Google's TurboQuant Compresses AI Models to 2.5 Bits Without Breaking Them

Google Handles a Billion Health Questions a Day. Should We Be Worried?