Gemma 4: Frontier Multimodal Intelligence You Can Run Anywhere

Overview

Four Models, One Family: From Edge to Server-Grade Performance

Released April 2, 2026 under Apache 2.0, Gemma 4 delivers frontier-class multimodal intelligence across four architectures. From ultra-mobile 2B edge models to the flagship 31B dense variant, every size processes text, images with variable resolution, video, and audio natively.

Edge Models

Gemma 4 E2B & E4B: On-Device Intelligence

Ultra-compact models with 2.3B and 4.5B effective parameters, built for Pixel, Chrome, and browser deployment with native audio support and 128K context.

The E2B and E4B variants use Per-Layer Embeddings (PLE) to maximize parameter efficiency. They support text, image, video, and audio inputs natively, making them ideal for privacy-focused on-device applications.

Server Models

Gemma 4 31B Dense & 26B MoE: Frontier Performance

The 31B dense model ranks #3 on Arena AI leaderboard with 89.2% on AIME 2026. The 26B MoE activates only 4B parameters per token while maintaining similar quality.

Both models feature 256K context windows, native function calling, and configurable thinking modes. The 31B achieves 85.2% on MMLU Pro and 80% on LiveCodeBench v6, competing with models many times its size.

Capabilities

Native Multimodal

All models process text, images with variable aspect ratios, video, and audio natively. E2B and E4B include audio encoders for speech understanding.

The vision encoder uses learned 2D positions and multidimensional RoPE, preserving original aspect ratios. Images can be encoded to different token budgets (70, 140, 280, 560, 1120) for optimal speed-quality tradeoffs.

All Models

Architecture

Extended Context Windows

Small models feature 128K context, while medium models support 256K. Dual RoPE configurations enable longer context processing.

Alternating local sliding-window (512-1024 tokens) and global full-context attention layers optimize memory usage. Shared KV cache reduces compute and memory for long-context generation.

128K-256K

Features

Configurable Thinking

All models support configurable thinking modes for advanced reasoning tasks, with native system prompt support for structured conversations.

The 31B model achieves 89.2% on AIME 2026 math reasoning and 84.3% on GPQA Diamond. Built-in function calling powers autonomous agents without fine-tuning.

All Models

Performance

Coding & Agentic Power

The 31B model scores 80% on LiveCodeBench v6 and reaches 2150 Codeforces ELO. The 26B MoE achieves 77.1% with only 4B active parameters.

Notable improvements in coding benchmarks alongside built-in function calling support enable highly capable autonomous agents. HLE benchmark shows 19.5% without tools, 26.5% with search.

Optimized

Multimodal

Vision & Document Analysis

The 31B model achieves 76.9% on MMMU Pro and 85.6% on MATH-Vision. OmniDocBench edit distance of 0.131 demonstrates strong OCR capabilities.

Variable aspect ratio support and configurable image token budgets enable efficient processing of documents, diagrams, and screenshots. The E4B model reaches 52.6% on MMMU Pro despite its compact size.

All Models

Integration

Deploy Anywhere

Day-0 support for transformers, llama.cpp, MLX, WebGPU, Mistral.rs, and more. ONNX checkpoints enable edge device deployment.

Apache 2.0 license permits responsible commercial use. Available on Kaggle, Hugging Face, and through Google AI Studio. Compatible with local tools like Ollama for private, offline interactions.

Open Source

Get Started

Start Chatting with Gemma 4 Today

Experience Google DeepMind's frontier multimodal models for free. No credit card required to start your first conversation.

Introduction

Watch: Gemma 4 Official Introduction

Learn about the four model architectures, native multimodal capabilities, and deployment options from Google DeepMind.

Performance

Frontier Performance Across Reasoning, Coding, and Vision

Gemma 4 models form a Pareto frontier, delivering exceptional performance relative to their size. The 31B dense model ranks #3 among all open models on Arena AI leaderboard.

Official benchmarks demonstrate competitive performance with models many times larger. The 31B model achieves 89.2% on AIME 2026 math reasoning, while the 26B MoE reaches similar quality with only 4B active parameters.

Gemma 4 performance comparison across model sizes and benchmarks

The 31B model achieves 89.2% on AIME 2026 and 85.2% on MMLU Pro, competing with models over 100B parameters.

Coding performance reaches 80% on LiveCodeBench v6 and 2150 Codeforces ELO, ahead of many larger models.

Vision capabilities include 76.9% on MMMU Pro and 85.6% on MATH-Vision, with strong OCR and document understanding.

Official Benchmarks

Gemma 4 Performance Across Key Tasks

Comprehensive evaluation across reasoning, coding, vision, audio, and long-context tasks demonstrates frontier-class capabilities.

Benchmark
Gemma 4 31B
Dense flagship
31B
Gemma 4 26B A4B
MoE (4B active)
26B
Gemma 4 E4B
Edge model
E4B
Gemma 4 E2B
Ultra-compact
E2B
MMLU Pro
Knowledge & reasoning
85.2%82.6%69.4%60.0%
AIME 2026 (no tools)
Math reasoning
89.2%88.3%42.5%37.5%
GPQA Diamond
Graduate-level science
84.3%82.3%58.6%43.4%
LiveCodeBench v6
Coding performance
80.0%77.1%52.0%44.0%
Codeforces ELO
Competitive programming
21501718940633
MMMU Pro
Multimodal understanding
76.9%73.8%52.6%44.2%
MATH-Vision
Visual math reasoning
85.6%82.4%59.5%52.4%
OmniDocBench 1.5
Document OCR (edit distance)
0.1310.1490.1810.290
Context Window
Maximum tokens
256K256K128K128K
Audio Support
Native audio input
NoNoYesYes

All figures from official Gemma 4 model card and Hugging Face blog. E2B and E4B benchmarks demonstrate exceptional efficiency for their parameter count.

Server Models

31B Dense & 26B MoE: Frontier Performance for Production

The 31B dense model ranks #3 on Arena AI leaderboard with 89.2% on AIME 2026. The 26B MoE activates only 4B parameters per token while maintaining similar quality, ideal for high-throughput scenarios.

  • 31B Dense: 89.2% AIME 2026, 85.2% MMLU Pro, 80% LiveCodeBench v6, 2150 Codeforces ELO
  • 26B MoE (4B active): 88.3% AIME 2026, 82.6% MMLU Pro, 77.1% LiveCodeBench v6
  • 256K context windows with dual RoPE configurations for efficient long-context processing

Edge Models

E2B & E4B: On-Device Intelligence with Audio Support

Ultra-compact models with 2.3B and 4.5B effective parameters, designed for Pixel, Chrome, and browser deployment. Native audio encoders enable real-time speech understanding on-device.

  • E2B (2.3B effective, 5.1B with embeddings): 60% MMLU Pro, 44% LiveCodeBench, 128K context
  • E4B (4.5B effective, 8B with embeddings): 69.4% MMLU Pro, 52% LiveCodeBench, 128K context
  • Per-Layer Embeddings (PLE) maximize parameter efficiency for edge deployment

Architecture

Per-Layer Embeddings and Shared KV Cache

Gemma 4 introduces architectural innovations that maximize efficiency. PLE gives each decoder layer its own conditioning pathway, while shared KV cache reduces memory usage during long-context generation.

  • Per-Layer Embeddings add meaningful specialization at modest parameter cost
  • Shared KV cache: last N layers reuse key-value states, eliminating redundant projections
  • Alternating local sliding-window and global full-context attention for optimal memory usage
Gemma 4 architecture performance comparison

Multimodal

Native Image, Video, and Audio Understanding

All models process text and images with variable aspect ratios natively. Vision encoder uses learned 2D positions and can encode images to different token budgets (70-1120) for speed-quality tradeoffs.

  • Variable aspect ratio support preserves original image dimensions
  • Configurable image token budgets: 70, 140, 280, 560, 1120 tokens
  • E2B and E4B include USM-style conformer audio encoders for speech processing
Gemma 4 multimodal benchmark performance

Deployment

Deploy Anywhere: Browser, Local, or Cloud

Day-0 support for transformers, llama.cpp, MLX, WebGPU, Mistral.rs, and more. E2B and E4B run in browsers with transformers.js, while 31B and 26B excel on server hardware.

  • Browser: transformers.js enables E2B/E4B in Chrome with WebGPU acceleration
  • Local: Ollama, llama.cpp, MLX (Apple Silicon), Mistral.rs for private inference
  • Cloud: Google AI Studio, Vertex AI, or self-hosted with vLLM and TGI
Gemma 4 deployment options and performance

FAQ

Model Architecture and Capabilities

Understanding Gemma 4's technical innovations, from Per-Layer Embeddings to multimodal processing.

What makes Gemma 4 different from previous Gemma versions?

Gemma 4 introduces native multimodal support (text, image, video, audio), extended context windows (128K-256K), configurable thinking modes, and built-in function calling. The architecture uses Per-Layer Embeddings (PLE) for efficiency and shared KV cache to reduce memory usage during long-context generation.

What are the four Gemma 4 model sizes and when should I use each?

E2B (2.3B effective) and E4B (4.5B effective) are designed for edge devices, browsers, and mobile with native audio support. The 26B A4B is a Mixture-of-Experts model activating only 4B parameters per token, ideal for high-throughput scenarios. The 31B dense model is the flagship for maximum performance on reasoning, coding, and vision tasks.

How does Gemma 4 handle multimodal inputs?

All models process text and images with variable aspect ratios natively. The vision encoder uses learned 2D positions and can encode images to different token budgets (70-1120 tokens) for speed-quality tradeoffs. E2B and E4B include USM-style conformer audio encoders for speech understanding. Video is supported across the family by processing frames and audio tracks.

What is Per-Layer Embeddings (PLE) and why does it matter?

PLE gives each decoder layer its own small embedding for every token, creating a parallel conditioning pathway alongside the main residual stream. This allows each layer to receive token-specific information only when relevant, rather than packing everything into a single upfront embedding. It adds meaningful per-layer specialization at modest parameter cost, making small models more efficient.

FAQ

Deployment and Integration

Getting started with Gemma 4 across different platforms, from cloud to edge devices.

FAQ

Performance and Comparisons

How Gemma 4 compares to other models and what makes it competitive for different use cases.

How does Gemma 4 31B compare to larger models like Llama 3.3 70B?

The 31B model ranks #3 on Arena AI leaderboard among open models, ahead of Llama 3.3 70B despite being less than half the size. It achieves 89.2% on AIME 2026 math reasoning, 85.2% on MMLU Pro, and 80% on LiveCodeBench v6. The efficiency comes from architectural innovations like alternating attention patterns and shared KV cache.

What is the Mixture-of-Experts (MoE) architecture in the 26B model?

The 26B A4B model has 26 billion total parameters but activates only 4 billion per token during generation. All 26B parameters must be loaded into memory for fast routing, but inference cost is closer to a 4B model. This achieves 88.3% on AIME 2026 and 82.6% on MMLU Pro with significantly lower compute per token than the dense 31B model.

Can Gemma 4 handle long documents and extended context?

Yes. Small models support 128K context windows, while medium models handle 256K tokens. The architecture uses dual RoPE configurations (standard for sliding layers, pruned for global layers) to enable longer context. Shared KV cache reduces memory consumption during long-context generation, making it practical for processing entire codebases and research papers.

Where can I find fine-tuning examples and training resources?

Gemma 4 is fully supported in TRL (Transformer Reinforcement Learning), with examples for multimodal tool responses and environment interaction. Hugging Face provides fine-tuning guides for Vertex AI using SFT. Unsloth Studio offers a UI-based fine-tuning experience. The models support PEFT methods like LoRA for parameter-efficient training.