Gemma 4 Local

Run Gemma 4 on your own hardware - private, offline, no API keys

Every Gemma 4 model runs locally. From the 3.2GB E2B on a phone to the 31B flagship on a workstation. Ollama, llama.cpp, MLX, transformers, and browser deployment - pick your tool and start in minutes.

Hardware requirements

What you need to run each model locally

Memory requirements depend on model size and quantization level. 4-bit quantization offers the best balance of quality and memory usage for most local deployments.

Hardware Guide

Match your hardware to the right model

E2B runs on phones and budget laptops. E4B fits comfortably on most laptops. The 26B MoE needs a decent GPU. The 31B Dense needs a workstation-class setup.

All memory figures are for model weights only. Add 2-4GB for context window (KV cache) depending on your use case.

Phone / Budget laptop

E2B (3.2-10GB)

4-bit: ~3.2GB | 8-bit: ~5-8GB | BF16: ~10GB. Runs on phones, Raspberry Pi, and budget hardware.

~95 tok/s on consumer GPUs. The fastest model in the family. Ideal for real-time applications.

Easiest to run

Laptop / Desktop

E4B (5.5-16GB)

4-bit: ~5.5-6GB | 8-bit: ~9-12GB | BF16: ~16GB. Best edge model for everyday local use.

Good speed on RTX 3060+ or M1+ Macs. The recommended starting point for most local users.

Recommended

GPU workstation

26B MoE (16-48GB)

4-bit: ~16GB | 8-bit: ~24GB | BF16: ~48GB. Near-31B quality on a single RTX 4090 or M4 Pro.

~2-8 tok/s depending on hardware. Best for batch processing and quality-critical local tasks.

Power users

Multi-GPU / Server

31B Dense (17-58GB)

4-bit: ~17GB | 8-bit: ~29GB | BF16: ~58GB. Maximum quality for local deployment.

Requires RTX 4090+ or M4 Max+ for comfortable use. Best for maximum quality without cloud dependency.

Maximum quality

Deployment tools

Six ways to run Gemma 4 locally

From one-command Ollama setup to custom llama.cpp builds, there's a local deployment path for every skill level.

Ollama

One command to install, one command to run. The easiest path to local Gemma 4. HTTP API included for integration with other tools.

llama.cpp

Maximum control over quantization, context size, and GPU layers. Best for power users who want to tune every parameter.

MLX (Apple Silicon)

Optimized for M1/M2/M3/M4 Macs. Leverages unified memory for efficient inference on Apple hardware.

transformers (Python)

Full Hugging Face ecosystem integration. Best for Python developers who want to script, fine-tune, or build custom pipelines.

transformers.js (Browser)

Run E2B and E4B directly in Chrome with WebGPU. No installation, no server - just open a webpage.

LM Studio

GUI-based local model management. Download, configure, and chat with Gemma 4 through a desktop application.

Quick start

Get running in 2 minutes with Ollama

The fastest path from zero to local Gemma 4. Install Ollama, pull a model, start chatting.

Install & run

  • Install: curl -fsSL https://ollama.com/install.sh | sh
  • Run E4B: ollama run gemma4:e4b
  • Run 26B: ollama run gemma4:26b
  • Run 31B: ollama run gemma4:31b
  • API: curl http://localhost:11434/api/generate -d '{...}'

Tips

  • Start with E4B if you have 8-16GB RAM
  • Use 4-bit quantization (Q4_K_M) for best quality/memory ratio
  • Add --num-gpu-layers for GPU acceleration in llama.cpp
  • Set context size based on your available memory
  • Monitor VRAM usage - leave headroom for KV cache

Local performance

Real-world speed and quality on consumer hardware

Actual performance varies by hardware, quantization, and context length. Here's what to expect on common setups.

Local inference speed depends on your GPU, RAM, quantization level, and context length. These figures represent typical performance on common consumer hardware.

Gemma 4 local performance across different hardware configurations

E2B at 4-bit: ~95 tok/s on RTX 3060, ~60 tok/s on M1 MacBook

E4B at 4-bit: ~40-60 tok/s on RTX 3060, ~30 tok/s on M1 MacBook

26B at 4-bit: ~8-15 tok/s on RTX 4090, ~5 tok/s on M4 Pro

31B at 4-bit: ~5-10 tok/s on RTX 4090, ~3 tok/s on M4 Max

Hardware requirements

VRAM and RAM requirements by quantization

Choose your quantization level based on available memory. 4-bit (Q4_K_M) offers the best quality-to-memory ratio for most users.

Benchmark
E2B
E2B
E4B
E4B
26B MoE
26B
31B Dense
31B
4-bit (Q4_K_M)
Recommended
~3.2 GB~5.5 GB~16 GB~17 GB
8-bit (Q8_0)
Higher quality
~5-8 GB~9-12 GB~24 GB~29 GB
BF16 / FP16
Full precision
~10 GB~16 GB~48 GB~58 GB
Min GPU
Comfortable use
Any 4GB+RTX 3060+RTX 40902x RTX 4090
Apple Silicon
Recommended Mac
Any M1+M1+ 16GBM4 Pro 24GBM4 Max 64GB

Memory figures are for model weights only. Add 2-4GB for KV cache depending on context length.

Privacy First

Your data never leaves your device

Running Gemma 4 locally means complete privacy. No API calls, no data logging, no internet required after download. Process sensitive documents, code, and conversations with zero exposure.

  • Zero data transmission - everything stays on your hardware
  • No API keys, no accounts, no usage tracking
  • Process confidential documents and proprietary code safely
Your data never leaves your device

Browser AI

Run Gemma 4 in your browser - no installation needed

The E2B and E4B models run directly in Chrome with WebGPU via transformers.js. No server, no installation, no configuration. Just open a webpage and start chatting.

  • transformers.js enables in-browser inference with WebGPU
  • E2B and E4B optimized for browser deployment
  • Works in Chrome, Edge, and other WebGPU-capable browsers
Run Gemma 4 in your browser - no installation needed

Developer Tools

Integrate local Gemma 4 into your workflow

Use Gemma 4 as a local coding assistant with Claude Code, VS Code, or any tool that supports OpenAI-compatible APIs. Ollama and llama.cpp both expose compatible endpoints.

  • OpenAI-compatible API via Ollama (localhost:11434)
  • Works with Claude Code, Continue, Cursor, and other AI tools
  • Fine-tune with TRL, Unsloth, or Keras for custom tasks
Integrate local Gemma 4 into your workflow

Local AI ecosystem

Tools and platforms for local Gemma 4

A growing ecosystem of tools makes running Gemma 4 locally easier than ever.

Ollama

Easiest local deployment with HTTP API

Get started

llama.cpp

Maximum control and customization

Learn more

LM Studio

Desktop GUI for local model management

Download

transformers.js

Browser-based inference with WebGPU

Try it

MLX

Apple Silicon optimized inference

Get started

vLLM

High-throughput local serving

Deploy

Get started

Run Gemma 4 on your hardware today

Try it online first, then download for private, offline use. No API keys, no accounts, no data leaves your device.