Gemma 4: Frontier Multimodal Intelligence You Can Run Anywhere

Overview

Four Models, One Family: From Edge to Server-Grade Performance

Released April 2, 2026 under Apache 2.0, Gemma 4 delivers frontier-class multimodal intelligence across four architectures. From ultra-mobile 2B edge models to the flagship 31B dense variant, every size processes text, images with variable resolution, video, and audio natively.

Edge Models

Gemma 4 E2B & E4B: On-Device Intelligence

Ultra-compact models with 2.3B and 4.5B effective parameters, built for Pixel, Chrome, and browser deployment with native audio support and 128K context.

The E2B and E4B variants use Per-Layer Embeddings (PLE) to maximize parameter efficiency. They support text, image, video, and audio inputs natively, making them ideal for privacy-focused on-device applications.

Try E4B Free Learn More

Server Models

Gemma 4 31B Dense & 26B MoE: Frontier Performance

The 31B dense model ranks #3 on Arena AI leaderboard with 89.2% on AIME 2026. The 26B MoE activates only 4B parameters per token while maintaining similar quality.

Both models feature 256K context windows, native function calling, and configurable thinking modes. The 31B achieves 85.2% on MMLU Pro and 80% on LiveCodeBench v6, competing with models many times its size.

See Benchmarks View on Hugging Face

Capabilities

Native Multimodal

All models process text, images with variable aspect ratios, video, and audio natively. E2B and E4B include audio encoders for speech understanding.

The vision encoder uses learned 2D positions and multidimensional RoPE, preserving original aspect ratios. Images can be encoded to different token budgets (70, 140, 280, 560, 1120) for optimal speed-quality tradeoffs.

All Models

See Examples

Architecture

Extended Context Windows

Small models feature 128K context, while medium models support 256K. Dual RoPE configurations enable longer context processing.

Alternating local sliding-window (512-1024 tokens) and global full-context attention layers optimize memory usage. Shared KV cache reduces compute and memory for long-context generation.

128K-256K

Features

Configurable Thinking

All models support configurable thinking modes for advanced reasoning tasks, with native system prompt support for structured conversations.

The 31B model achieves 89.2% on AIME 2026 math reasoning and 84.3% on GPQA Diamond. Built-in function calling powers autonomous agents without fine-tuning.

All Models

Performance

Coding & Agentic Power

The 31B model scores 80% on LiveCodeBench v6 and reaches 2150 Codeforces ELO. The 26B MoE achieves 77.1% with only 4B active parameters.

Notable improvements in coding benchmarks alongside built-in function calling support enable highly capable autonomous agents. HLE benchmark shows 19.5% without tools, 26.5% with search.

Optimized

View Benchmarks

Multimodal

Vision & Document Analysis

The 31B model achieves 76.9% on MMMU Pro and 85.6% on MATH-Vision. OmniDocBench edit distance of 0.131 demonstrates strong OCR capabilities.

Variable aspect ratio support and configurable image token budgets enable efficient processing of documents, diagrams, and screenshots. The E4B model reaches 52.6% on MMMU Pro despite its compact size.

All Models

Integration

Deploy Anywhere

Day-0 support for transformers, llama.cpp, MLX, WebGPU, Mistral.rs, and more. ONNX checkpoints enable edge device deployment.

Apache 2.0 license permits responsible commercial use. Available on Kaggle, Hugging Face, and through Google AI Studio. Compatible with local tools like Ollama for private, offline interactions.

Open Source

Get Started

Start Chatting with Gemma 4 Today

Experience Google DeepMind's frontier multimodal models for free. No credit card required to start your first conversation.

Start Free Chat View Pricing

Introduction

Watch: Gemma 4 Official Introduction

Learn about the four model architectures, native multimodal capabilities, and deployment options from Google DeepMind.

Performance

Frontier Performance Across Reasoning, Coding, and Vision

Gemma 4 models form a Pareto frontier, delivering exceptional performance relative to their size. The 31B dense model ranks #3 among all open models on Arena AI leaderboard.

Official benchmarks demonstrate competitive performance with models many times larger. The 31B model achieves 89.2% on AIME 2026 math reasoning, while the 26B MoE reaches similar quality with only 4B active parameters.

Try It Now Read Technical Details

Gemma 4 performance comparison across model sizes and benchmarks

The 31B model achieves 89.2% on AIME 2026 and 85.2% on MMLU Pro, competing with models over 100B parameters.

Coding performance reaches 80% on LiveCodeBench v6 and 2150 Codeforces ELO, ahead of many larger models.

Vision capabilities include 76.9% on MMMU Pro and 85.6% on MATH-Vision, with strong OCR and document understanding.

Official Benchmarks

Gemma 4 Performance Across Key Tasks

Comprehensive evaluation across reasoning, coding, vision, audio, and long-context tasks demonstrates frontier-class capabilities.

Benchmark	Gemma 4 31B Dense flagship 31B	Gemma 4 26B A4B MoE (4B active) 26B	Gemma 4 E4B Edge model E4B	Gemma 4 E2B Ultra-compact E2B
MMLU Pro Knowledge & reasoning	85.2%	82.6%	69.4%	60.0%
AIME 2026 (no tools) Math reasoning	89.2%	88.3%	42.5%	37.5%
GPQA Diamond Graduate-level science	84.3%	82.3%	58.6%	43.4%
LiveCodeBench v6 Coding performance	80.0%	77.1%	52.0%	44.0%
Codeforces ELO Competitive programming	2150	1718	940	633
MMMU Pro Multimodal understanding	76.9%	73.8%	52.6%	44.2%
MATH-Vision Visual math reasoning	85.6%	82.4%	59.5%	52.4%
OmniDocBench 1.5 Document OCR (edit distance)	0.131	0.149	0.181	0.290
Context Window Maximum tokens	256K	256K	128K	128K
Audio Support Native audio input	No	No	Yes	Yes

All figures from official Gemma 4 model card and Hugging Face blog. E2B and E4B benchmarks demonstrate exceptional efficiency for their parameter count.

Server Models

31B Dense & 26B MoE: Frontier Performance for Production

The 31B dense model ranks #3 on Arena AI leaderboard with 89.2% on AIME 2026. The 26B MoE activates only 4B parameters per token while maintaining similar quality, ideal for high-throughput scenarios.

31B Dense: 89.2% AIME 2026, 85.2% MMLU Pro, 80% LiveCodeBench v6, 2150 Codeforces ELO
26B MoE (4B active): 88.3% AIME 2026, 82.6% MMLU Pro, 77.1% LiveCodeBench v6
256K context windows with dual RoPE configurations for efficient long-context processing

Try 26B Model View on Hugging Face

Edge Models

E2B & E4B: On-Device Intelligence with Audio Support

Ultra-compact models with 2.3B and 4.5B effective parameters, designed for Pixel, Chrome, and browser deployment. Native audio encoders enable real-time speech understanding on-device.

E2B (2.3B effective, 5.1B with embeddings): 60% MMLU Pro, 44% LiveCodeBench, 128K context
E4B (4.5B effective, 8B with embeddings): 69.4% MMLU Pro, 52% LiveCodeBench, 128K context
Per-Layer Embeddings (PLE) maximize parameter efficiency for edge deployment

Try in Browser transformers.js Demo

Architecture

Per-Layer Embeddings and Shared KV Cache

Gemma 4 introduces architectural innovations that maximize efficiency. PLE gives each decoder layer its own conditioning pathway, while shared KV cache reduces memory usage during long-context generation.

Per-Layer Embeddings add meaningful specialization at modest parameter cost
Shared KV cache: last N layers reuse key-value states, eliminating redundant projections
Alternating local sliding-window and global full-context attention for optimal memory usage

Technical Details

Gemma 4 architecture performance comparison

Multimodal

Native Image, Video, and Audio Understanding

All models process text and images with variable aspect ratios natively. Vision encoder uses learned 2D positions and can encode images to different token budgets (70-1120) for speed-quality tradeoffs.

Variable aspect ratio support preserves original image dimensions
Configurable image token budgets: 70, 140, 280, 560, 1120 tokens
E2B and E4B include USM-style conformer audio encoders for speech processing

Try Multimodal Chat See Examples

Gemma 4 multimodal benchmark performance

Deployment

Deploy Anywhere: Browser, Local, or Cloud

Day-0 support for transformers, llama.cpp, MLX, WebGPU, Mistral.rs, and more. E2B and E4B run in browsers with transformers.js, while 31B and 26B excel on server hardware.

Browser: transformers.js enables E2B/E4B in Chrome with WebGPU acceleration
Local: Ollama, llama.cpp, MLX (Apple Silicon), Mistral.rs for private inference
Cloud: Google AI Studio, Vertex AI, or self-hosted with vLLM and TGI

Start Free View on GitHub

Gemma 4 deployment options and performance

FAQ

Model Architecture and Capabilities

Understanding Gemma 4's technical innovations, from Per-Layer Embeddings to multimodal processing.

What makes Gemma 4 different from previous Gemma versions?

Gemma 4 introduces native multimodal support (text, image, video, audio), extended context windows (128K-256K), configurable thinking modes, and built-in function calling. The architecture uses Per-Layer Embeddings (PLE) for efficiency and shared KV cache to reduce memory usage during long-context generation.

What are the four Gemma 4 model sizes and when should I use each?

E2B (2.3B effective) and E4B (4.5B effective) are designed for edge devices, browsers, and mobile with native audio support. The 26B A4B is a Mixture-of-Experts model activating only 4B parameters per token, ideal for high-throughput scenarios. The 31B dense model is the flagship for maximum performance on reasoning, coding, and vision tasks.

How does Gemma 4 handle multimodal inputs?

All models process text and images with variable aspect ratios natively. The vision encoder uses learned 2D positions and can encode images to different token budgets (70-1120 tokens) for speed-quality tradeoffs. E2B and E4B include USM-style conformer audio encoders for speech understanding. Video is supported across the family by processing frames and audio tracks.

What is Per-Layer Embeddings (PLE) and why does it matter?

PLE gives each decoder layer its own small embedding for every token, creating a parallel conditioning pathway alongside the main residual stream. This allows each layer to receive token-specific information only when relevant, rather than packing everything into a single upfront embedding. It adds meaningful per-layer specialization at modest parameter cost, making small models more efficient.

FAQ

Deployment and Integration

Getting started with Gemma 4 across different platforms, from cloud to edge devices.

Where can I download and run Gemma 4 models?

Gemma 4 models are available on Kaggle and Hugging Face under Apache 2.0 license. You can use them through Google AI Studio, deploy on Vertex AI, or run locally with tools like Ollama, llama.cpp, MLX (for Apple Silicon), transformers, and Mistral.rs. ONNX checkpoints enable browser and edge device deployment.

What are the hardware requirements for running Gemma 4?

E2B requires ~9.6GB (BF16) to 3.2GB (4-bit) VRAM. E4B needs ~15GB (BF16) to 5GB (4-bit). The 31B model requires ~58GB (BF16) to 17GB (4-bit). The 26B MoE needs ~48GB (BF16) to 16GB (4-bit). These are base weights only; add memory for context window (KV cache) based on your use case.

Can I run Gemma 4 in the browser or on mobile devices?

Yes. The E2B and E4B models are specifically designed for browser and mobile deployment. transformers.js enables running Gemma 4 directly in browsers with WebGPU support. ONNX checkpoints work on various edge hardware backends. The models are optimized for Pixel devices and Chrome browser environments.

How do I use Gemma 4 with function calling and agents?

Gemma 4 has built-in function calling support without requiring fine-tuning. The models can parse tool definitions, generate structured JSON calls, and handle multimodal function calling (e.g., analyzing an image and calling a weather API). This powers autonomous agents for tasks like code execution, web browsing, and data retrieval.

FAQ

Performance and Comparisons

How Gemma 4 compares to other models and what makes it competitive for different use cases.

How does Gemma 4 31B compare to larger models like Llama 3.3 70B?

The 31B model ranks #3 on Arena AI leaderboard among open models, ahead of Llama 3.3 70B despite being less than half the size. It achieves 89.2% on AIME 2026 math reasoning, 85.2% on MMLU Pro, and 80% on LiveCodeBench v6. The efficiency comes from architectural innovations like alternating attention patterns and shared KV cache.

What is the Mixture-of-Experts (MoE) architecture in the 26B model?

The 26B A4B model has 26 billion total parameters but activates only 4 billion per token during generation. All 26B parameters must be loaded into memory for fast routing, but inference cost is closer to a 4B model. This achieves 88.3% on AIME 2026 and 82.6% on MMLU Pro with significantly lower compute per token than the dense 31B model.

Can Gemma 4 handle long documents and extended context?

Yes. Small models support 128K context windows, while medium models handle 256K tokens. The architecture uses dual RoPE configurations (standard for sliding layers, pruned for global layers) to enable longer context. Shared KV cache reduces memory consumption during long-context generation, making it practical for processing entire codebases and research papers.

Where can I find fine-tuning examples and training resources?

Gemma 4 is fully supported in TRL (Transformer Reinforcement Learning), with examples for multimodal tool responses and environment interaction. Hugging Face provides fine-tuning guides for Vertex AI using SFT. Unsloth Studio offers a UI-based fine-tuning experience. The models support PEFT methods like LoRA for parameter-efficient training.

Gemma 4: Frontier Multimodal Intelligence You Can Run Anywhere

Four Models, One Family: From Edge to Server-Grade Performance

Gemma 4 E2B & E4B: On-Device Intelligence

Gemma 4 31B Dense & 26B MoE: Frontier Performance

Native Multimodal

Extended Context Windows

Configurable Thinking

Coding & Agentic Power

Vision & Document Analysis

Deploy Anywhere

Start Chatting with Gemma 4 Today

Watch: Gemma 4 Official Introduction

Gemma 4: From Edge to Cloud

Frontier Performance Across Reasoning, Coding, and Vision

Gemma 4 Performance Across Key Tasks

31B Dense & 26B MoE: Frontier Performance for Production

E2B & E4B: On-Device Intelligence with Audio Support

Per-Layer Embeddings and Shared KV Cache

Native Image, Video, and Audio Understanding

Deploy Anywhere: Browser, Local, or Cloud

Model Architecture and Capabilities

Deployment and Integration

Performance and Comparisons