Gemma 4 E2B: Google's Pocket-Sized On-Device LLM

1.5 GB quantized footprint, multimodal text+vision+audio, Apache 2.0 license — Gemma 4 E2B is one of 2026's most deployable on-device large language models.

Last reviewed: May 2026
Parameters2.3 B
Size (quantized)1.5 GB
Context length128,000 tokens
Modalitytext+vision+audio
Licenseapache-2.0
Min RAM4 GB
VersionGemma 4 E2B-it
Released2026-04

What is it?

Gemma 4 E2B is Google DeepMind’s mobile-first member of the Gemma 4 family, released in April 2026. With 2.3 billion effective parameters (using the Per-Layer Embedding architecture) and a 1.5 GB quantized footprint, it’s purpose-built to run entirely on consumer phones — no cloud calls, no streaming, no privacy compromises. Cove uses Gemma 4 across all four apps (Travel, Voice, Photo, Health), making it the model with the most real-world consumer deployments at the time of writing.

Note on parameter counts: official labels say “E2B = 2.3B effective parameters”, referring to weights active in each forward pass. The Per-Layer Embedding (PLE) lookup tables bring the total weight count to ~5.1B, but those tables are accessed selectively rather than computed through. The 1.5 GB quantized footprint is what hits your phone’s storage.

Core specs at a glance

(See spec card above — populated from structured data.)

What devices can run it?

Gemma 4 E2B runs comfortably on flagship Android (Pixel 8 and newer, Galaxy S24+, OnePlus 12+) and iPhone 15 Pro / Pro Max / 16 family. It will technically install on devices with 6 GB RAM, but token throughput drops sharply below 8 GB. iPad M-series and recent MacBook Air / Pro models are also supported, where it benefits from the higher memory bandwidth.

Strengths and limitations

Strengths. Best-in-class size-to-quality ratio for general text tasks, native multimodal support (text + vision + audio), Apache 2.0 license, and Google’s active maintenance with quarterly updates. Distillation from larger Gemini family models gives it broader knowledge than its parameter count suggests.

Limitations. Below Phi-4-multimodal on math and reasoning benchmarks. The 128K context is now on par with Llama 3.2, so it is no longer the long-document bottleneck — but multilingual quality is uneven: strong on top 20 languages, weaker on under-represented ones.

When to choose it (and when not to)

Choose Gemma 4 E2B if: you need a balanced general-purpose on-device model, you want text + vision + audio in one runtime, you’re shipping to phones with 4+ GB RAM as a baseline, and license simplicity matters.

Skip it if: your workload is reasoning-heavy (use Phi-4-multimodal or DeepSeek-R1 Distill), you need million-token context (still cloud-only territory), or you’re targeting Apple-only and want first-party tools (use Apple Foundation Models).

How it compares to similar on-device models

The two closest siblings are Microsoft Phi-4-multimodal (larger, sharper reasoning, MIT license, also text+vision+audio) and Qwen 3.5 2B (stronger Chinese, comparable size, 262K context). For full side-by-side, see the leaderboard.

In a real Cove app

Cove Travel uses Gemma 4 for camera-based menu translation and offline voice translation; Cove Voice uses it for AI-summarized voice notes. Both apps demonstrate that Gemma 4 E2B is production-ready for consumer use cases, not just a research demo.

See it in a real Cove app

FAQ

Can Gemma 4 E2B run on iPhone?

Yes. Gemma 4 E2B runs on iPhone 15 Pro or newer, taking advantage of the Apple Neural Engine and 8 GB unified memory. Older models like iPhone 14 lack sufficient RAM headroom for 4-bit quantized 2 B-parameter inference.

What is the actual download size?

About 1.5 GB after 4-bit quantization, thanks to the Per-Layer Embedding (PLE) architecture that Gemma 4 introduced. The unquantized weights are around 4 GB, so Cove and similar apps ship the quantized version to keep storage costs reasonable for end users.

Is Gemma 4 E2B open source?

Yes. Gemma 4 is released under Apache 2.0 — Google moved away from the older Gemma terms of use starting with this generation. The weights are open and permitted for commercial use with the standard Apache attribution requirements.

How fast is inference on a phone?

Roughly 20-40 tokens per second on flagship phones (Pixel 8 Pro, iPhone 15 Pro, Galaxy S24+). Older mid-range phones drop to 5-10 tok/s. Time-to-first-token is 200-500ms depending on prompt length.

How does Gemma 4 E2B compare to Phi-4-multimodal?

Gemma 4 E2B is much smaller (2.3 B effective vs 5.6 B parameters) and faster on the same hardware, while Phi-4-multimodal is stronger at reasoning. Both now support text+vision+audio, so the choice usually comes down to your RAM budget. See our Phi-4 comparison for a full breakdown.

Citations