Gemma 4 Models

Four models, one family - from edge to frontier

The Gemma 4 family spans four architectures: ultra-compact E2B and E4B for edge devices, the 26B MoE for efficient server deployment, and the 31B Dense flagship. All share native multimodal support, configurable thinking, and Apache 2.0 licensing.

All models

Choose the right Gemma 4 for your use case

Each model in the family is optimized for different deployment scenarios. Edge models include audio support, while server models offer 256K context and frontier-class reasoning.

Edge Models

E2B & E4B: On-device intelligence with audio

Ultra-compact models with 2.3B and 4.5B effective parameters. Both include native audio encoders, 128K context, and run on phones, browsers, and IoT devices.

Choose E2B for the smallest footprint (3.2GB at 4-bit). Choose E4B for better quality (5.5GB at 4-bit). Both support text, image, video, and audio input.

Server Models

26B MoE & 31B Dense: Frontier performance

The 26B MoE activates only 4B parameters per token for efficient serving. The 31B Dense is the flagship with #3 Arena AI ranking. Both feature 256K context and native function calling.

Choose 26B for high-throughput production (16GB at 4-bit). Choose 31B for maximum quality (17GB at 4-bit). Both excel at reasoning, coding, and multimodal tasks.

Edge - Ultra-compact

Gemma 4 E2B

2.3B effective parameters. The smallest Gemma 4 with full multimodal + audio support.

35 layers, PLE architecture, ~150M vision + ~300M audio encoder. 3.2GB VRAM at 4-bit.

Available now

Edge - Recommended

Gemma 4 E4B

4.5B effective parameters. Best edge model with strong reasoning and audio support.

42 layers, PLE architecture, ~150M vision + ~300M audio encoder. 5.5GB VRAM at 4-bit.

Available now

Server - Efficient

Gemma 4 26B A4B

25.2B total, 3.8B active per token. Near-31B quality at a fraction of the compute.

MoE with 128 experts (8 active + 1 shared). 256K context. 16GB VRAM at 4-bit.

Available now

Server - Flagship

Gemma 4 31B

30.7B dense parameters. #3 on Arena AI. Maximum intelligence and reliability.

Dense architecture, 256K context, 140+ languages. 17GB VRAM at 4-bit.

Available now

Shared capabilities

What every Gemma 4 model can do

All four models share a common set of capabilities that make the Gemma 4 family uniquely versatile.

Native multimodal

All models process text and images natively. Edge models add audio and video support. No separate encoders or pipelines needed.

Configurable thinking

All models support thinking modes for step-by-step reasoning. Control the depth of reasoning based on task complexity.

Function calling

Built-in function calling across the family enables agentic workflows. No fine-tuning required for tool use.

Extended context

128K tokens for edge models, 256K for server models. Hybrid attention keeps memory usage practical.

140+ languages

Multilingual support with cultural context understanding across all model sizes.

Apache 2.0 license

Full commercial freedom. No MAU caps, no acceptable-use restrictions. Deploy anywhere, modify freely.

Quick selection guide

Which model should you choose?

Match your deployment constraints and quality requirements to the right Gemma 4 variant.

By hardware

  • Phone / IoT / 4GB RAM: Gemma 4 E2B
  • Laptop / 8-16GB RAM: Gemma 4 E4B
  • Single GPU / 16-24GB VRAM: Gemma 4 26B A4B
  • Multi-GPU / 24GB+ VRAM: Gemma 4 31B

By use case

  • Voice assistant / audio: E2B or E4B (audio support)
  • Browser-based AI: E2B or E4B (WebGPU)
  • High-throughput API: 26B A4B (MoE efficiency)
  • Maximum quality: 31B Dense (frontier performance)

Performance

Complete benchmark comparison across all four models

Every Gemma 4 model forms part of a Pareto frontier - each size delivers exceptional performance relative to its parameter count.

From the ultra-compact E2B to the flagship 31B, each model is optimized for its deployment tier while sharing the same architectural innovations.

Gemma 4 family performance comparison across all model sizes

31B Dense: #3 on Arena AI (ELO 1452), 89.2% AIME 2026, 80% LiveCodeBench v6

26B MoE: Near-31B quality (ELO 1441) with only 4B active parameters per token

E4B: 69.4% MMLU Pro, 52% LiveCodeBench - strong edge performance with audio

E2B: 60% MMLU Pro, 44% LiveCodeBench - meaningful AI at 3.2GB VRAM

Full family comparison

All Gemma 4 models side by side

Complete benchmark results across reasoning, coding, multimodal, and deployment metrics.

Benchmark
31B Dense
Flagship
31B
26B A4B
MoE
26B
E4B
Edge
E4B
E2B
Compact
E2B
Arena AI ELO
Overall ranking
14521441--
MMLU Pro
Knowledge & reasoning
85.2%82.6%69.4%60.0%
AIME 2026
Mathematics
89.2%88.3%42.5%37.5%
LiveCodeBench v6
Coding
80.0%77.1%52.0%44.0%
GPQA Diamond
Science
84.3%82.3%58.6%43.4%
MMMU Pro
Multimodal
76.9%73.8%52.6%44.2%
Context Window
Max tokens
256K256K128K128K
Audio Support
Native audio
NoNoYesYes
VRAM (4-bit)
Minimum memory
~17 GB~16 GB~5.5 GB~3.2 GB

All figures from official Gemma 4 model card. Arena AI scores as of April 2, 2026.

Edge Tier

E2B & E4B: AI that runs on your device

The edge models bring full multimodal AI to phones, browsers, and IoT devices. Both include native audio encoders - a capability the larger models don't have. Choose E2B for the smallest footprint, E4B for better quality.

  • E2B: 2.3B effective, 3.2GB at 4-bit, 95 tok/s on consumer hardware
  • E4B: 4.5B effective, 5.5GB at 4-bit, strong reasoning and coding
  • Both: native audio, 128K context, WebGPU browser support
E2B & E4B: AI that runs on your device

Server Tier

26B MoE & 31B Dense: Frontier performance

The server models deliver frontier-class reasoning, coding, and multimodal understanding. The 26B MoE offers near-31B quality at a fraction of the compute. The 31B Dense is the flagship for maximum performance.

  • 26B MoE: 3.8B active per token, ELO 1441, 88.3% AIME 2026
  • 31B Dense: Full 30.7B active, ELO 1452, 89.2% AIME 2026
  • Both: 256K context, native function calling, 140+ languages
26B MoE & 31B Dense: Frontier performance

Architecture

Shared innovations across the family

All Gemma 4 models share key architectural innovations from Google DeepMind's research. Per-Layer Embeddings, shared KV cache, and hybrid attention patterns maximize efficiency at every scale.

  • Per-Layer Embeddings (PLE) for parameter-efficient conditioning
  • Shared KV cache reduces memory during long-context generation
  • Hybrid local/global attention for optimal memory-quality tradeoff
Shared innovations across the family

Gemma 4 Family

Explore each model in detail

Dive deeper into each Gemma 4 variant with dedicated pages covering architecture, benchmarks, and deployment guides.

Gemma 4 E2B

Ultra-compact 2.3B edge model with audio

Explore

Gemma 4 E4B

Recommended 4.5B edge model with audio

Explore

Gemma 4 26B

Efficient MoE with 4B active parameters

Explore

Gemma 4 31B

Flagship dense model, #3 on Arena AI

Explore

Run Locally

Guide to running Gemma 4 on your hardware

Read guide

API Access

Use Gemma 4 through hosted APIs

Get started

Get started

Find your Gemma 4 model

Start chatting with any Gemma 4 model for free, or download weights for local deployment. Apache 2.0 licensed for full commercial freedom.