Overview
Released April 2, 2026 under Apache 2.0, Gemma 4 delivers frontier-class multimodal intelligence across four architectures. From ultra-mobile 2B edge models to the flagship 31B dense variant, every size processes text, images with variable resolution, video, and audio natively.
Edge Models
Ultra-compact models with 2.3B and 4.5B effective parameters, built for Pixel, Chrome, and browser deployment with native audio support and 128K context.
The E2B and E4B variants use Per-Layer Embeddings (PLE) to maximize parameter efficiency. They support text, image, video, and audio inputs natively, making them ideal for privacy-focused on-device applications.
Server Models
The 31B dense model ranks #3 on Arena AI leaderboard with 89.2% on AIME 2026. The 26B MoE activates only 4B parameters per token while maintaining similar quality.
Both models feature 256K context windows, native function calling, and configurable thinking modes. The 31B achieves 85.2% on MMLU Pro and 80% on LiveCodeBench v6, competing with models many times its size.
Capabilities
All models process text, images with variable aspect ratios, video, and audio natively. E2B and E4B include audio encoders for speech understanding.
The vision encoder uses learned 2D positions and multidimensional RoPE, preserving original aspect ratios. Images can be encoded to different token budgets (70, 140, 280, 560, 1120) for optimal speed-quality tradeoffs.
Architecture
Small models feature 128K context, while medium models support 256K. Dual RoPE configurations enable longer context processing.
Alternating local sliding-window (512-1024 tokens) and global full-context attention layers optimize memory usage. Shared KV cache reduces compute and memory for long-context generation.
Features
All models support configurable thinking modes for advanced reasoning tasks, with native system prompt support for structured conversations.
The 31B model achieves 89.2% on AIME 2026 math reasoning and 84.3% on GPQA Diamond. Built-in function calling powers autonomous agents without fine-tuning.
Performance
The 31B model scores 80% on LiveCodeBench v6 and reaches 2150 Codeforces ELO. The 26B MoE achieves 77.1% with only 4B active parameters.
Notable improvements in coding benchmarks alongside built-in function calling support enable highly capable autonomous agents. HLE benchmark shows 19.5% without tools, 26.5% with search.
Multimodal
The 31B model achieves 76.9% on MMMU Pro and 85.6% on MATH-Vision. OmniDocBench edit distance of 0.131 demonstrates strong OCR capabilities.
Variable aspect ratio support and configurable image token budgets enable efficient processing of documents, diagrams, and screenshots. The E4B model reaches 52.6% on MMMU Pro despite its compact size.
Integration
Day-0 support for transformers, llama.cpp, MLX, WebGPU, Mistral.rs, and more. ONNX checkpoints enable edge device deployment.
Apache 2.0 license permits responsible commercial use. Available on Kaggle, Hugging Face, and through Google AI Studio. Compatible with local tools like Ollama for private, offline interactions.
Get Started
Experience Google DeepMind's frontier multimodal models for free. No credit card required to start your first conversation.
Introduction
Learn about the four model architectures, native multimodal capabilities, and deployment options from Google DeepMind.
Performance
Gemma 4 models form a Pareto frontier, delivering exceptional performance relative to their size. The 31B dense model ranks #3 among all open models on Arena AI leaderboard.
Official benchmarks demonstrate competitive performance with models many times larger. The 31B model achieves 89.2% on AIME 2026 math reasoning, while the 26B MoE reaches similar quality with only 4B active parameters.


The 31B model achieves 89.2% on AIME 2026 and 85.2% on MMLU Pro, competing with models over 100B parameters.
Coding performance reaches 80% on LiveCodeBench v6 and 2150 Codeforces ELO, ahead of many larger models.
Vision capabilities include 76.9% on MMMU Pro and 85.6% on MATH-Vision, with strong OCR and document understanding.
Official Benchmarks
Comprehensive evaluation across reasoning, coding, vision, audio, and long-context tasks demonstrates frontier-class capabilities.
| Benchmark | Gemma 4 31B Dense flagship 31B | Gemma 4 26B A4B MoE (4B active) 26B | Gemma 4 E4B Edge model E4B | Gemma 4 E2B Ultra-compact E2B |
|---|---|---|---|---|
MMLU Pro Knowledge & reasoning | 85.2% | 82.6% | 69.4% | 60.0% |
AIME 2026 (no tools) Math reasoning | 89.2% | 88.3% | 42.5% | 37.5% |
GPQA Diamond Graduate-level science | 84.3% | 82.3% | 58.6% | 43.4% |
LiveCodeBench v6 Coding performance | 80.0% | 77.1% | 52.0% | 44.0% |
Codeforces ELO Competitive programming | 2150 | 1718 | 940 | 633 |
MMMU Pro Multimodal understanding | 76.9% | 73.8% | 52.6% | 44.2% |
MATH-Vision Visual math reasoning | 85.6% | 82.4% | 59.5% | 52.4% |
OmniDocBench 1.5 Document OCR (edit distance) | 0.131 | 0.149 | 0.181 | 0.290 |
Context Window Maximum tokens | 256K | 256K | 128K | 128K |
Audio Support Native audio input | No | No | Yes | Yes |
All figures from official Gemma 4 model card and Hugging Face blog. E2B and E4B benchmarks demonstrate exceptional efficiency for their parameter count.
Server Models
The 31B dense model ranks #3 on Arena AI leaderboard with 89.2% on AIME 2026. The 26B MoE activates only 4B parameters per token while maintaining similar quality, ideal for high-throughput scenarios.
Edge Models
Ultra-compact models with 2.3B and 4.5B effective parameters, designed for Pixel, Chrome, and browser deployment. Native audio encoders enable real-time speech understanding on-device.
Architecture
Gemma 4 introduces architectural innovations that maximize efficiency. PLE gives each decoder layer its own conditioning pathway, while shared KV cache reduces memory usage during long-context generation.

Multimodal
All models process text and images with variable aspect ratios natively. Vision encoder uses learned 2D positions and can encode images to different token budgets (70-1120) for speed-quality tradeoffs.

Deployment
Day-0 support for transformers, llama.cpp, MLX, WebGPU, Mistral.rs, and more. E2B and E4B run in browsers with transformers.js, while 31B and 26B excel on server hardware.

FAQ
Understanding Gemma 4's technical innovations, from Per-Layer Embeddings to multimodal processing.
Gemma 4 introduces native multimodal support (text, image, video, audio), extended context windows (128K-256K), configurable thinking modes, and built-in function calling. The architecture uses Per-Layer Embeddings (PLE) for efficiency and shared KV cache to reduce memory usage during long-context generation.
E2B (2.3B effective) and E4B (4.5B effective) are designed for edge devices, browsers, and mobile with native audio support. The 26B A4B is a Mixture-of-Experts model activating only 4B parameters per token, ideal for high-throughput scenarios. The 31B dense model is the flagship for maximum performance on reasoning, coding, and vision tasks.
All models process text and images with variable aspect ratios natively. The vision encoder uses learned 2D positions and can encode images to different token budgets (70-1120 tokens) for speed-quality tradeoffs. E2B and E4B include USM-style conformer audio encoders for speech understanding. Video is supported across the family by processing frames and audio tracks.
PLE gives each decoder layer its own small embedding for every token, creating a parallel conditioning pathway alongside the main residual stream. This allows each layer to receive token-specific information only when relevant, rather than packing everything into a single upfront embedding. It adds meaningful per-layer specialization at modest parameter cost, making small models more efficient.
FAQ
Getting started with Gemma 4 across different platforms, from cloud to edge devices.
Gemma 4 models are available on Kaggle and Hugging Face under Apache 2.0 license. You can use them through Google AI Studio, deploy on Vertex AI, or run locally with tools like Ollama, llama.cpp, MLX (for Apple Silicon), transformers, and Mistral.rs. ONNX checkpoints enable browser and edge device deployment.
E2B requires ~9.6GB (BF16) to 3.2GB (4-bit) VRAM. E4B needs ~15GB (BF16) to 5GB (4-bit). The 31B model requires ~58GB (BF16) to 17GB (4-bit). The 26B MoE needs ~48GB (BF16) to 16GB (4-bit). These are base weights only; add memory for context window (KV cache) based on your use case.
Yes. The E2B and E4B models are specifically designed for browser and mobile deployment. transformers.js enables running Gemma 4 directly in browsers with WebGPU support. ONNX checkpoints work on various edge hardware backends. The models are optimized for Pixel devices and Chrome browser environments.
Gemma 4 has built-in function calling support without requiring fine-tuning. The models can parse tool definitions, generate structured JSON calls, and handle multimodal function calling (e.g., analyzing an image and calling a weather API). This powers autonomous agents for tasks like code execution, web browsing, and data retrieval.
FAQ
How Gemma 4 compares to other models and what makes it competitive for different use cases.
The 31B model ranks #3 on Arena AI leaderboard among open models, ahead of Llama 3.3 70B despite being less than half the size. It achieves 89.2% on AIME 2026 math reasoning, 85.2% on MMLU Pro, and 80% on LiveCodeBench v6. The efficiency comes from architectural innovations like alternating attention patterns and shared KV cache.
The 26B A4B model has 26 billion total parameters but activates only 4 billion per token during generation. All 26B parameters must be loaded into memory for fast routing, but inference cost is closer to a 4B model. This achieves 88.3% on AIME 2026 and 82.6% on MMLU Pro with significantly lower compute per token than the dense 31B model.
Yes. Small models support 128K context windows, while medium models handle 256K tokens. The architecture uses dual RoPE configurations (standard for sliding layers, pruned for global layers) to enable longer context. Shared KV cache reduces memory consumption during long-context generation, making it practical for processing entire codebases and research papers.
Gemma 4 is fully supported in TRL (Transformer Reinforcement Learning), with examples for multimodal tool responses and environment interaction. Hugging Face provides fine-tuning guides for Vertex AI using SFT. Unsloth Studio offers a UI-based fine-tuning experience. The models support PEFT methods like LoRA for parameter-efficient training.