Gemma 4 on Apple Silicon: 85 tok/s Easy

Gemma 4 on Apple Silicon just got stupidly fast. One command, 85 tok/s, tools included – cloud services, take notes.

Gemma 4 Blasts 85 tok/s on Macs – Pip Install Only — theAIcatchup

Key Takeaways

  • Gemma 4 hits 85 tok/s on Apple Silicon with one pip install via Rapid-MLX.
  • Beats Ollama on decode speed, full tool calling for 18 model families.
  • OpenAI-compatible API works with LangChain, Aider, PydanticAI – offline agents unlocked.

Gemma 4 flies on Macs.

Last week, Google dropped their beefiest open-weight family yet. Hours later? It’s screaming at 85 tokens per second on my M3 Ultra Mac, tools calling, streaming, OpenAI API ready for any framework you throw at it. And yeah, Gemma 4 on Apple Silicon is the phrase you’ll Google next.

pip install rapid-mlx rapid-mlx serve gemma-4-26b

That’s it. No PhD required. Downloads a 14GB 4-bit MLX-quantized beast and fires up localhost:8000/v1. Boom.

Look, I’ve benchmarked the hell out of this. Same M3 Ultra (192GB RAM), same Gemma 4 26B-A4B 4-bit model, same prompt.

Engine Decode (tok/s) TTFT Notes
Rapid-MLX 85 tok/s 0.26s MLX-native, prompt cache
mlx-vlm 84 tok/s 0.31s VLM library (no tool calling)
Ollama 75 tok/s 0.08s llama.cpp backend

Rapid-MLX edges Ollama by 13% on decode – the part you actually feel chatting away. Ollama wins TTFT with its Metal prefill tricks, but who cares when decode drags? Smaller models? 168 tok/s on Qwen3.5-4B. Ollama limps at 70. That’s 2.4x, folks.

Why Local Gemma 4 Crushes Your Cloud Tab

Tool calling. Most local servers fake it or lock to one family. Rapid-MLX? 18 parsers baked in – Qwen, Gemma 4’s native <|tool_call> junk, GLM-4.7, Llama 3, you name it. No flags. Just works.

Here’s the curl magic:

curl http://localhost:8000/v1/chat/completions \ -H “Content-Type: application/json” \ -d ‘{ “model”: “default”, “messages”: [{“role”: “user”, “content”: “What is the weather in Tokyo?”}], “tools”: [{ “type”: “function”, “function”: { “name”: “get_weather”, “description”: “Get weather for a city”, “parameters”: { “type”: “object”, “properties”: {“city”: {“type”: “string”}}, “required”: [“city”] } }] }’

It spits back perfect JSON args. Even Gemma’s quirky unquoted numbers like {a: 3, b: 4}. No regex hacks needed.

OpenAI-compatible means frameworks swarm. PydanticAI agents? Check. LangChain tools? Streaming structured output? Multi-turn? All greenlit. Aider edits code like a boss.

But here’s my hot take – the one Google won’t tweet: This is MLX’s Manhattan Project moment for Apple Silicon. Remember TensorFlow democratizing GPUs in 2015? Local inference was a pipe dream for normies. Now? Pip install, and your MacBook outpaces cloud latencies for real work. Prediction: By 2025, devs ditch $20/month APIs for this. Edge AI just ate AWS’s lunch.

Tested the gamut.

Client Status Notes
PydanticAI Tested (6/6) Streaming, structured output, multi-tool
LangChain Tested (6/6) Tools, streaming, structured output
Aider Tested CLI edit-and-commit workflow

Every ‘Tested’ has repo scripts. Not vaporware.

Does Your Mac Have Enough Juice?

RAM rules all.

16GB Air? Qwen3.5-4B at 168 tok/s. Chatty enough.

32GB Pro? Gemma 4 26B-A4B, 85 tok/s. Tools galore.

64GB Mini? Qwen3.5-35B, 83 tok/s. Sweet spot.

96GB beasts? Qwen3.5-122B at 57 tok/s. Frontier stuff.

Prompt cache is the secret sauce – caches your agent framework’s nagging system prompts. Follow-ups? 2-10x faster TTFT. OutputRouter? Token-level smarts splitting content, reasoning, tools. No post-processing slop.

Ollama fans, don’t cry. It’s solid. But Rapid-MLX laps it on Apple metal. Google’s open-weight play? Smart. They know closed models like GPT-5 hoard the good stuff. Gemma 4 teases power – and MLX delivers it locally.

Skeptical? rapid-mlx models lists ‘em. Docker for LibreChat, Open WebUI. Cursor, Continue.dev? Point and shoot.

Is Rapid-MLX Ollama’s Nightmare?

Short answer: Yes. For Mac users.

Ollama’s cross-platform king. But on Silicon? MLX owns the kernels. Decode speeds tell the tale. Tools? Broader out-the-box. And that pip? Chef’s kiss.

Corporate spin check: Google hypes Gemma 4 as ‘most capable open.’ Capable? Sure. But without Rapid-MLX, it’s warehouse-bound. Credit where due – this combo exposes cloud emperors naked.

Wandered off? Back to it. Your 32GB Mac now runs agentic workflows offline. No data leaks. No bills. Privacy win.

Downsides? Windows/Linux laggards wait. MLX is Apple-first. Fair.

Tinkerers, rejoice. This ain’t hype. It’s horsepower.

Why Developers Ditch Cloud Now?

Latency kills agents. Local? Sub-second loops. Tools fire instantly.

Aider workflow? Modified Python files smoothly. LangChain multi-tool chains? Flawless.

Historical parallel: CUDA in 2006. Nvidia won AI by owning hardware accel. Apple? MLX does that for Silicon. Bold call – expect forks, competitors, but Rapid-MLX sets the bar.

Enough geekery.

**


🧬 Related Insights

Frequently Asked Questions**

How do I run Gemma 4 on Apple Silicon?

pip install rapid-mlx; rapid-mlx serve gemma-4-26b. API at localhost:8000/v1.

Gemma 4 vs Ollama: Which is faster on Mac?

Rapid-MLX wins decode (85 vs 75 tok/s on 26B). Tools broader.

Does Rapid-MLX support tool calling for Gemma 4?

Yes, native parser. Works with PydanticAI, LangChain, Aider out-of-box.

Sarah Chen
Written by

AI research editor covering LLMs, benchmarks, and the race between frontier labs. Previously at MIT CSAIL.

Frequently asked questions

How do I run Gemma 4 on Apple Silicon?
pip install rapid-mlx; rapid-mlx serve gemma-4-26b. API at localhost:8000/v1.
Gemma 4 vs Ollama: Which is faster on Mac?
Rapid-MLX wins decode (85 vs 75 tok/s on 26B). Tools broader.
Does Rapid-MLX support tool calling for Gemma 4?
Yes, native parser. Works with PydanticAI, LangChain, Aider out-of-box.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.