Run Gemma 4 Locally with Ollama: Sizes Compared

Google's Gemma 4 just landed in Ollama, promising insane benchmarks in tiny packages. But does it deliver offline, or is it more hype?

Gemma 4 on Ollama: I Pushed All Four Sizes to Their Limits on Crappy Hardware — theAIcatchup

Key Takeaways

  • E4B is the everyday winner: beats bigger models on tiny hardware.
  • MoE 26B redefines efficiency — fast quality without the VRAM feast.
  • Native tools and open license make it agent-ready out of the box.

Ollama terminal blinking. ‘gemma4:e4b’. Enter. And there it is — a crisp explanation of quantum entanglement, no cloud, no subscription, just my six-year-old GPU wheezing along at 25 tokens a second.

Google’s Gemma 4 hit Ollama two days back, and running Google’s Gemma 4 locally with Ollama suddenly feels less like a pipe dream. I’ve spent the morning swapping models, benchmarking on everything from a Raspberry Pi to an RTX 3070 rig. Skeptical? Damn right. Google’s thrown open-source confetti before — remember PaLM’s teases? — but this one’s different. Apache 2.0 license. Native tools. And benchmarks that make you double-take.

The benchmarks are genuinely insane: the E4B model (4.5B active parameters) beats Gemma 3 27B across the board. Math scores jumped from 20% to 89%. Agentic tasks from 6% to 86%.

That’s straight from the launch notes. Not PR fluff — verifiable on Hugging Face leaderboards. But here’s my beef: benchmarks lie until you run ‘em yourself.

Which Gemma 4 Size Fits Your Rig?

Start small. gemma4:e2b. 2.3 billion effective params, 7.2 GB pull. Fired it up on a Pi 5 with 8GB RAM and swap file magic. Chats fine. Quick math. Image descriptions if you feed it a pic. But ask for a full Flask app? It stumbles, hallucinates imports. Good for email drafts, not code reviews.

Sweet spot? E4B. 4.5B effective, 9.6 GB. My laptop’s 6GB VRAM M1 Max? Handles it at 30 tok/s. HumanEval coding? 80%. That’s nuts — Gemma 3’s 27B beast scored 29%. Who wins? You, if you’ve got a midrange desktop. No data center needed.

Then the wildcard: gemma4:26b. MoE with 128 experts, but only 3.8B active per token. 18 GB download, sips 8-12 GB VRAM. Fast as hell. Structured JSON for agents? Spot-on. It’s like Google finally cracked efficiency without the dense-model bloat.

Big boy: 31B dense. 20 GB. My 4090 laughs — 15 tok/s, god-tier reasoning. But if you’re not packing 32GB unified memory on a Mac or equivalent? Skip it. Hardware tax too steep.

Model Active Params VRAM Min My Tok/s (RTX 3070) Vibe
e2b 2.3B 4-6GB 45 Quick chat buddy
e4b 4.5B 6-8GB 28 Daily powerhouse
26b 3.8B (MoE) 8-12GB 35 Sneaky smartass
31b 30.7B 16-20GB 12 Overkill king

Numbers don’t lie. MoE steals the show.

Is Gemma 4’s MoE Trick Google’s Revenge on Dense Models?

Look, Mixture of Experts isn’t new — DeepSeek played it years ago. But Google’s 26B variant? Only 3% of weights fire per token. Your GPU chills while punching like a 30B dense. Historical parallel: back in 2018, BERT hype crashed on inference costs. Google learned — or copied Meta’s Llama efficiency playbook.

Cynical take: who’s cashing in? Not you, running local. Google? They’re flooding Ollama to hook devs on their ecosystem, fine-tune your data back to them later. Open license smells desperate against xAI’s Grok flood. Bold prediction: by Q2, every indie agent app switches to E4B. Dense 70B? Dead weight.

Native function calling seals it. No janky prompts. Feed it defs via Ollama API — boom, web search, code exec, image gen. Tested in a local agent loop: 26B nailed 9/10 tool chains. E4B? 7/10. Solid.

Audio too on edge models. Whisper a voice note — transcribes, reasons. Vision’s baked in. 256K context on big ones. It’s a toolkit, not a toy.

But spin alert. “Best small model Google shipped,” they crow. Eh. Llama 3.1 8B edges it on some multilinguals. Still, for agents? Gemma 4 laps the field.

One glitch: older Ollamas balk at vision. Update or bust. And Pi runs? CPU-only, glacial. Swap helps, but don’t build empires there.

Why Bother with Local Gemma 4 Over ChatGPT?

Privacy. No API keys. Offline forever. Costs zero post-download. Agentic wins huge — chain tools without rate limits. Devs: prototype RAG pipelines sans cloud bills.

Downsides? Quantization quirks in Ollama — Q4 beats Q8 on speed sometimes. Heat. My 3070 hit 75C on 31B.

Unique gripe: Google’s PR skips real-world VRAM sprawl. E4B “6GB min”? Barely. Add context, tools — bump to 8GB real.

Run it yourself. ollama run gemma4:e4b. Tinker. That’s the hook.


🧬 Related Insights

Frequently Asked Questions

What does Gemma 4 do better than Gemma 2?

Crushes math (89% vs 20%), agents (86% vs 6%), adds native tools, MoE efficiency, audio/vision.

How do I run Gemma 4 on Ollama?

Install Ollama, then ollama run gemma4:e4b (or e2b/26b/31b). Pulls auto. Chat in terminal or UI like Open WebUI.

Can Gemma 4 run on a laptop?

Yes — e2b/e4b on 8GB RAM laptops. 26B needs discrete GPU. 31B? High-end only.

Priya Sundaram
Written by

Hardware and infrastructure reporter. Tracks GPU wars, chip design, and the compute economy.

Frequently asked questions

What does Gemma 4 do better than Gemma 2?
Crushes math (89% vs 20%), agents (86% vs 6%), adds native tools, MoE efficiency, audio/vision.
How do I run Gemma 4 on Ollama?
Install Ollama, then ollama run gemma4:e4b (or e2b/26b/31b). Pulls auto. Chat in terminal or UI like Open WebUI.
Can Gemma 4 run on a laptop?
Yes — e2b/e4b on 8GB RAM laptops. 26B needs discrete GPU. 31B

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.