Gemma 4 Local

Run Gemma 4 on your own hardware - private, offline, no API keys

Every Gemma 4 model runs locally. From the 3.2GB E2B on a phone to the 31B flagship on a workstation. Ollama, llama.cpp, MLX, transformers, and browser deployment - pick your tool and start in minutes.

Try Online First See hardware requirements

Hardware requirements

What you need to run each model locally

Memory requirements depend on model size and quantization level. 4-bit quantization offers the best balance of quality and memory usage for most local deployments.

Hardware Guide

Match your hardware to the right model

E2B runs on phones and budget laptops. E4B fits comfortably on most laptops. The 26B MoE needs a decent GPU. The 31B Dense needs a workstation-class setup.

All memory figures are for model weights only. Add 2-4GB for context window (KV cache) depending on your use case.

Try Online First Download models

Phone / Budget laptop

E2B (3.2-10GB)

4-bit: ~3.2GB | 8-bit: ~5-8GB | BF16: ~10GB. Runs on phones, Raspberry Pi, and budget hardware.

~95 tok/s on consumer GPUs. The fastest model in the family. Ideal for real-time applications.

Easiest to run

Download E2B Setup guide

Laptop / Desktop

E4B (5.5-16GB)

4-bit: ~5.5-6GB | 8-bit: ~9-12GB | BF16: ~16GB. Best edge model for everyday local use.

Good speed on RTX 3060+ or M1+ Macs. The recommended starting point for most local users.

Recommended

Download E4B Setup guide

GPU workstation

26B MoE (16-48GB)

4-bit: ~16GB | 8-bit: ~24GB | BF16: ~48GB. Near-31B quality on a single RTX 4090 or M4 Pro.

~2-8 tok/s depending on hardware. Best for batch processing and quality-critical local tasks.

Power users

Download 26B Setup guide

Multi-GPU / Server

31B Dense (17-58GB)

4-bit: ~17GB | 8-bit: ~29GB | BF16: ~58GB. Maximum quality for local deployment.

Requires RTX 4090+ or M4 Max+ for comfortable use. Best for maximum quality without cloud dependency.

Maximum quality

Download 31B Setup guide

Deployment tools

Six ways to run Gemma 4 locally

From one-command Ollama setup to custom llama.cpp builds, there's a local deployment path for every skill level.

Ollama

One command to install, one command to run. The easiest path to local Gemma 4. HTTP API included for integration with other tools.

llama.cpp

Maximum control over quantization, context size, and GPU layers. Best for power users who want to tune every parameter.

MLX (Apple Silicon)

Optimized for M1/M2/M3/M4 Macs. Leverages unified memory for efficient inference on Apple hardware.

transformers (Python)

Full Hugging Face ecosystem integration. Best for Python developers who want to script, fine-tune, or build custom pipelines.

transformers.js (Browser)

Run E2B and E4B directly in Chrome with WebGPU. No installation, no server - just open a webpage.

LM Studio

GUI-based local model management. Download, configure, and chat with Gemma 4 through a desktop application.

Quick start

Get running in 2 minutes with Ollama

The fastest path from zero to local Gemma 4. Install Ollama, pull a model, start chatting.

Install & run

Install: curl -fsSL https://ollama.com/install.sh | sh
Run E4B: ollama run gemma4:e4b
Run 26B: ollama run gemma4:26b
Run 31B: ollama run gemma4:31b
API: curl http://localhost:11434/api/generate -d '{...}'

Tips

Start with E4B if you have 8-16GB RAM
Use 4-bit quantization (Q4_K_M) for best quality/memory ratio
Add --num-gpu-layers for GPU acceleration in llama.cpp
Set context size based on your available memory
Monitor VRAM usage - leave headroom for KV cache

Try Online First Download models

Local performance

Real-world speed and quality on consumer hardware

Actual performance varies by hardware, quantization, and context length. Here's what to expect on common setups.

Local inference speed depends on your GPU, RAM, quantization level, and context length. These figures represent typical performance on common consumer hardware.

Try Online First Hardware guide

Gemma 4 local performance across different hardware configurations

E2B at 4-bit: ~95 tok/s on RTX 3060, ~60 tok/s on M1 MacBook

E4B at 4-bit: ~40-60 tok/s on RTX 3060, ~30 tok/s on M1 MacBook

26B at 4-bit: ~8-15 tok/s on RTX 4090, ~5 tok/s on M4 Pro

31B at 4-bit: ~5-10 tok/s on RTX 4090, ~3 tok/s on M4 Max

Hardware requirements

VRAM and RAM requirements by quantization

Choose your quantization level based on available memory. 4-bit (Q4_K_M) offers the best quality-to-memory ratio for most users.

Benchmark	E2B E2B	E4B E4B	26B MoE 26B	31B Dense 31B
4-bit (Q4_K_M) Recommended	~3.2 GB	~5.5 GB	~16 GB	~17 GB
8-bit (Q8_0) Higher quality	~5-8 GB	~9-12 GB	~24 GB	~29 GB
BF16 / FP16 Full precision	~10 GB	~16 GB	~48 GB	~58 GB
Min GPU Comfortable use	Any 4GB+	RTX 3060+	RTX 4090	2x RTX 4090
Apple Silicon Recommended Mac	Any M1+	M1+ 16GB	M4 Pro 24GB	M4 Max 64GB

Memory figures are for model weights only. Add 2-4GB for KV cache depending on context length.

Privacy First

Your data never leaves your device

Running Gemma 4 locally means complete privacy. No API calls, no data logging, no internet required after download. Process sensitive documents, code, and conversations with zero exposure.

Zero data transmission - everything stays on your hardware
No API keys, no accounts, no usage tracking
Process confidential documents and proprietary code safely

Download now Privacy guide

Browser AI

Run Gemma 4 in your browser - no installation needed

The E2B and E4B models run directly in Chrome with WebGPU via transformers.js. No server, no installation, no configuration. Just open a webpage and start chatting.

transformers.js enables in-browser inference with WebGPU
E2B and E4B optimized for browser deployment
Works in Chrome, Edge, and other WebGPU-capable browsers

Try in browser transformers.js docs

Run Gemma 4 in your browser - no installation needed

Developer Tools

Integrate local Gemma 4 into your workflow

Use Gemma 4 as a local coding assistant with Claude Code, VS Code, or any tool that supports OpenAI-compatible APIs. Ollama and llama.cpp both expose compatible endpoints.