Gemma 4 Local
Run Gemma 4 on your own hardware - private, offline, no API keys
Every Gemma 4 model runs locally. From the 3.2GB E2B on a phone to the 31B flagship on a workstation. Ollama, llama.cpp, MLX, transformers, and browser deployment - pick your tool and start in minutes.
Hardware requirements
What you need to run each model locally
Memory requirements depend on model size and quantization level. 4-bit quantization offers the best balance of quality and memory usage for most local deployments.
Hardware Guide
Match your hardware to the right model
E2B runs on phones and budget laptops. E4B fits comfortably on most laptops. The 26B MoE needs a decent GPU. The 31B Dense needs a workstation-class setup.
All memory figures are for model weights only. Add 2-4GB for context window (KV cache) depending on your use case.
Phone / Budget laptop
E2B (3.2-10GB)
4-bit: ~3.2GB | 8-bit: ~5-8GB | BF16: ~10GB. Runs on phones, Raspberry Pi, and budget hardware.
~95 tok/s on consumer GPUs. The fastest model in the family. Ideal for real-time applications.
Laptop / Desktop
E4B (5.5-16GB)
4-bit: ~5.5-6GB | 8-bit: ~9-12GB | BF16: ~16GB. Best edge model for everyday local use.
Good speed on RTX 3060+ or M1+ Macs. The recommended starting point for most local users.
GPU workstation
26B MoE (16-48GB)
4-bit: ~16GB | 8-bit: ~24GB | BF16: ~48GB. Near-31B quality on a single RTX 4090 or M4 Pro.
~2-8 tok/s depending on hardware. Best for batch processing and quality-critical local tasks.
Multi-GPU / Server
31B Dense (17-58GB)
4-bit: ~17GB | 8-bit: ~29GB | BF16: ~58GB. Maximum quality for local deployment.
Requires RTX 4090+ or M4 Max+ for comfortable use. Best for maximum quality without cloud dependency.
Deployment tools
Six ways to run Gemma 4 locally
From one-command Ollama setup to custom llama.cpp builds, there's a local deployment path for every skill level.
Ollama
One command to install, one command to run. The easiest path to local Gemma 4. HTTP API included for integration with other tools.
llama.cpp
Maximum control over quantization, context size, and GPU layers. Best for power users who want to tune every parameter.
MLX (Apple Silicon)
Optimized for M1/M2/M3/M4 Macs. Leverages unified memory for efficient inference on Apple hardware.
transformers (Python)
Full Hugging Face ecosystem integration. Best for Python developers who want to script, fine-tune, or build custom pipelines.
transformers.js (Browser)
Run E2B and E4B directly in Chrome with WebGPU. No installation, no server - just open a webpage.
LM Studio
GUI-based local model management. Download, configure, and chat with Gemma 4 through a desktop application.
Quick start
Get running in 2 minutes with Ollama
The fastest path from zero to local Gemma 4. Install Ollama, pull a model, start chatting.
Install & run
- Install: curl -fsSL https://ollama.com/install.sh | sh
- Run E4B: ollama run gemma4:e4b
- Run 26B: ollama run gemma4:26b
- Run 31B: ollama run gemma4:31b
- API: curl http://localhost:11434/api/generate -d '{...}'
Tips
- Start with E4B if you have 8-16GB RAM
- Use 4-bit quantization (Q4_K_M) for best quality/memory ratio
- Add --num-gpu-layers for GPU acceleration in llama.cpp
- Set context size based on your available memory
- Monitor VRAM usage - leave headroom for KV cache
Local performance
Real-world speed and quality on consumer hardware
Actual performance varies by hardware, quantization, and context length. Here's what to expect on common setups.
Local inference speed depends on your GPU, RAM, quantization level, and context length. These figures represent typical performance on common consumer hardware.


E2B at 4-bit: ~95 tok/s on RTX 3060, ~60 tok/s on M1 MacBook
E4B at 4-bit: ~40-60 tok/s on RTX 3060, ~30 tok/s on M1 MacBook
26B at 4-bit: ~8-15 tok/s on RTX 4090, ~5 tok/s on M4 Pro
31B at 4-bit: ~5-10 tok/s on RTX 4090, ~3 tok/s on M4 Max
Hardware requirements
VRAM and RAM requirements by quantization
Choose your quantization level based on available memory. 4-bit (Q4_K_M) offers the best quality-to-memory ratio for most users.
| Benchmark | E2B E2B | E4B E4B | 26B MoE 26B | 31B Dense 31B |
|---|---|---|---|---|
4-bit (Q4_K_M) Recommended | ~3.2 GB | ~5.5 GB | ~16 GB | ~17 GB |
8-bit (Q8_0) Higher quality | ~5-8 GB | ~9-12 GB | ~24 GB | ~29 GB |
BF16 / FP16 Full precision | ~10 GB | ~16 GB | ~48 GB | ~58 GB |
Min GPU Comfortable use | Any 4GB+ | RTX 3060+ | RTX 4090 | 2x RTX 4090 |
Apple Silicon Recommended Mac | Any M1+ | M1+ 16GB | M4 Pro 24GB | M4 Max 64GB |
Memory figures are for model weights only. Add 2-4GB for KV cache depending on context length.
Privacy First
Your data never leaves your device
Running Gemma 4 locally means complete privacy. No API calls, no data logging, no internet required after download. Process sensitive documents, code, and conversations with zero exposure.
- Zero data transmission - everything stays on your hardware
- No API keys, no accounts, no usage tracking
- Process confidential documents and proprietary code safely
Browser AI
Run Gemma 4 in your browser - no installation needed
The E2B and E4B models run directly in Chrome with WebGPU via transformers.js. No server, no installation, no configuration. Just open a webpage and start chatting.
- transformers.js enables in-browser inference with WebGPU
- E2B and E4B optimized for browser deployment
- Works in Chrome, Edge, and other WebGPU-capable browsers
Developer Tools
Integrate local Gemma 4 into your workflow
Use Gemma 4 as a local coding assistant with Claude Code, VS Code, or any tool that supports OpenAI-compatible APIs. Ollama and llama.cpp both expose compatible endpoints.
- OpenAI-compatible API via Ollama (localhost:11434)
- Works with Claude Code, Continue, Cursor, and other AI tools
- Fine-tune with TRL, Unsloth, or Keras for custom tasks
Quick start
Get Gemma 4 running locally
Choose your preferred tool and start in minutes.
Download weights
Get model files
Download official weights from trusted sources.
Advanced
Fine-tuning and customization
Customize Gemma 4 for your specific use case.
Local AI ecosystem
Tools and platforms for local Gemma 4
A growing ecosystem of tools makes running Gemma 4 locally easier than ever.
Get started
Run Gemma 4 on your hardware today
Try it online first, then download for private, offline use. No API keys, no accounts, no data leaves your device.