Gemma 4 Models
Four models, one family - from edge to frontier
The Gemma 4 family spans four architectures: ultra-compact E2B and E4B for edge devices, the 26B MoE for efficient server deployment, and the 31B Dense flagship. All share native multimodal support, configurable thinking, and Apache 2.0 licensing.
All models
Choose the right Gemma 4 for your use case
Each model in the family is optimized for different deployment scenarios. Edge models include audio support, while server models offer 256K context and frontier-class reasoning.
Edge Models
E2B & E4B: On-device intelligence with audio
Ultra-compact models with 2.3B and 4.5B effective parameters. Both include native audio encoders, 128K context, and run on phones, browsers, and IoT devices.
Choose E2B for the smallest footprint (3.2GB at 4-bit). Choose E4B for better quality (5.5GB at 4-bit). Both support text, image, video, and audio input.
Server Models
26B MoE & 31B Dense: Frontier performance
The 26B MoE activates only 4B parameters per token for efficient serving. The 31B Dense is the flagship with #3 Arena AI ranking. Both feature 256K context and native function calling.
Choose 26B for high-throughput production (16GB at 4-bit). Choose 31B for maximum quality (17GB at 4-bit). Both excel at reasoning, coding, and multimodal tasks.
Edge - Ultra-compact
Gemma 4 E2B
2.3B effective parameters. The smallest Gemma 4 with full multimodal + audio support.
35 layers, PLE architecture, ~150M vision + ~300M audio encoder. 3.2GB VRAM at 4-bit.
Edge - Recommended
Gemma 4 E4B
4.5B effective parameters. Best edge model with strong reasoning and audio support.
42 layers, PLE architecture, ~150M vision + ~300M audio encoder. 5.5GB VRAM at 4-bit.
Server - Efficient
Gemma 4 26B A4B
25.2B total, 3.8B active per token. Near-31B quality at a fraction of the compute.
MoE with 128 experts (8 active + 1 shared). 256K context. 16GB VRAM at 4-bit.
Server - Flagship
Gemma 4 31B
30.7B dense parameters. #3 on Arena AI. Maximum intelligence and reliability.
Dense architecture, 256K context, 140+ languages. 17GB VRAM at 4-bit.
Shared capabilities
What every Gemma 4 model can do
All four models share a common set of capabilities that make the Gemma 4 family uniquely versatile.
Native multimodal
All models process text and images natively. Edge models add audio and video support. No separate encoders or pipelines needed.
Configurable thinking
All models support thinking modes for step-by-step reasoning. Control the depth of reasoning based on task complexity.
Function calling
Built-in function calling across the family enables agentic workflows. No fine-tuning required for tool use.
Extended context
128K tokens for edge models, 256K for server models. Hybrid attention keeps memory usage practical.
140+ languages
Multilingual support with cultural context understanding across all model sizes.
Apache 2.0 license
Full commercial freedom. No MAU caps, no acceptable-use restrictions. Deploy anywhere, modify freely.
Quick selection guide
Which model should you choose?
Match your deployment constraints and quality requirements to the right Gemma 4 variant.
By hardware
- Phone / IoT / 4GB RAM: Gemma 4 E2B
- Laptop / 8-16GB RAM: Gemma 4 E4B
- Single GPU / 16-24GB VRAM: Gemma 4 26B A4B
- Multi-GPU / 24GB+ VRAM: Gemma 4 31B
By use case
- Voice assistant / audio: E2B or E4B (audio support)
- Browser-based AI: E2B or E4B (WebGPU)
- High-throughput API: 26B A4B (MoE efficiency)
- Maximum quality: 31B Dense (frontier performance)
Performance
Complete benchmark comparison across all four models
Every Gemma 4 model forms part of a Pareto frontier - each size delivers exceptional performance relative to its parameter count.
From the ultra-compact E2B to the flagship 31B, each model is optimized for its deployment tier while sharing the same architectural innovations.


31B Dense: #3 on Arena AI (ELO 1452), 89.2% AIME 2026, 80% LiveCodeBench v6
26B MoE: Near-31B quality (ELO 1441) with only 4B active parameters per token
E4B: 69.4% MMLU Pro, 52% LiveCodeBench - strong edge performance with audio
E2B: 60% MMLU Pro, 44% LiveCodeBench - meaningful AI at 3.2GB VRAM
Full family comparison
All Gemma 4 models side by side
Complete benchmark results across reasoning, coding, multimodal, and deployment metrics.
| Benchmark | 31B Dense Flagship 31B | 26B A4B MoE 26B | E4B Edge E4B | E2B Compact E2B |
|---|---|---|---|---|
Arena AI ELO Overall ranking | 1452 | 1441 | - | - |
MMLU Pro Knowledge & reasoning | 85.2% | 82.6% | 69.4% | 60.0% |
AIME 2026 Mathematics | 89.2% | 88.3% | 42.5% | 37.5% |
LiveCodeBench v6 Coding | 80.0% | 77.1% | 52.0% | 44.0% |
GPQA Diamond Science | 84.3% | 82.3% | 58.6% | 43.4% |
MMMU Pro Multimodal | 76.9% | 73.8% | 52.6% | 44.2% |
Context Window Max tokens | 256K | 256K | 128K | 128K |
Audio Support Native audio | No | No | Yes | Yes |
VRAM (4-bit) Minimum memory | ~17 GB | ~16 GB | ~5.5 GB | ~3.2 GB |
All figures from official Gemma 4 model card. Arena AI scores as of April 2, 2026.
Edge Tier
E2B & E4B: AI that runs on your device
The edge models bring full multimodal AI to phones, browsers, and IoT devices. Both include native audio encoders - a capability the larger models don't have. Choose E2B for the smallest footprint, E4B for better quality.
- E2B: 2.3B effective, 3.2GB at 4-bit, 95 tok/s on consumer hardware
- E4B: 4.5B effective, 5.5GB at 4-bit, strong reasoning and coding
- Both: native audio, 128K context, WebGPU browser support
Server Tier
26B MoE & 31B Dense: Frontier performance
The server models deliver frontier-class reasoning, coding, and multimodal understanding. The 26B MoE offers near-31B quality at a fraction of the compute. The 31B Dense is the flagship for maximum performance.
- 26B MoE: 3.8B active per token, ELO 1441, 88.3% AIME 2026
- 31B Dense: Full 30.7B active, ELO 1452, 89.2% AIME 2026
- Both: 256K context, native function calling, 140+ languages
Architecture
Shared innovations across the family
All Gemma 4 models share key architectural innovations from Google DeepMind's research. Per-Layer Embeddings, shared KV cache, and hybrid attention patterns maximize efficiency at every scale.
- Per-Layer Embeddings (PLE) for parameter-efficient conditioning
- Shared KV cache reduces memory during long-context generation
- Hybrid local/global attention for optimal memory-quality tradeoff

Try now
Start chatting with any Gemma 4 model
Try all models instantly through our chat interface, or download for local deployment.
Download
Get model weights
Download official weights for any Gemma 4 variant.
Deploy
Production deployment
Deploy on cloud, local, or edge platforms.
Gemma 4 Family
Explore each model in detail
Dive deeper into each Gemma 4 variant with dedicated pages covering architecture, benchmarks, and deployment guides.
Get started
Find your Gemma 4 model
Start chatting with any Gemma 4 model for free, or download weights for local deployment. Apache 2.0 licensed for full commercial freedom.