Gemma 4 on Apple Silicon: 85 tok/s Easy

Gemma 4 flies on Macs.

Last week, Google dropped their beefiest open-weight family yet. Hours later? It’s screaming at 85 tokens per second on my M3 Ultra Mac, tools calling, streaming, OpenAI API ready for any framework you throw at it. And yeah, Gemma 4 on Apple Silicon is the phrase you’ll Google next.

pip install rapid-mlx rapid-mlx serve gemma-4-26b

That’s it. No PhD required. Downloads a 14GB 4-bit MLX-quantized beast and fires up localhost:8000/v1. Boom.

Look, I’ve benchmarked the hell out of this. Same M3 Ultra (192GB RAM), same Gemma 4 26B-A4B 4-bit model, same prompt.

Engine Decode (tok/s) TTFT Notes

Rapid-MLX 85 tok/s 0.26s MLX-native, prompt cache

mlx-vlm 84 tok/s 0.31s VLM library (no tool calling)

Ollama 75 tok/s 0.08s llama.cpp backend

Engine	Decode (tok/s)	TTFT	Notes
Rapid-MLX	85 tok/s	0.26s	MLX-native, prompt cache
mlx-vlm	84 tok/s	0.31s	VLM library (no tool calling)
Ollama	75 tok/s	0.08s	llama.cpp backend

Rapid-MLX edges Ollama by 13% on decode – the part you actually feel chatting away. Ollama wins TTFT with its Metal prefill tricks, but who cares when decode drags? Smaller models? 168 tok/s on Qwen3.5-4B. Ollama limps at 70. That’s 2.4x, folks.

Why Local Gemma 4 Crushes Your Cloud Tab

Tool calling. Most local servers fake it or lock to one family. Rapid-MLX? 18 parsers baked in – Qwen, Gemma 4’s native <|tool_call> junk, GLM-4.7, Llama 3, you name it. No flags. Just works.

Here’s the curl magic:

curl http://localhost:8000/v1/chat/completions \ -H “Content-Type: application/json” \ -d ‘{ “model”: “default”, “messages”: [{“role”: “user”, “content”: “What is the weather in Tokyo?”}], “tools”: [{ “type”: “function”, “function”: { “name”: “get_weather”, “description”: “Get weather for a city”, “parameters”: { “type”: “object”, “properties”: {“city”: {“type”: “string”}}, “required”: [“city”] } }] }’

It spits back perfect JSON args. Even Gemma’s quirky unquoted numbers like {a: 3, b: 4}. No regex hacks needed.

OpenAI-compatible means frameworks swarm. PydanticAI agents? Check. LangChain tools? Streaming structured output? Multi-turn? All greenlit. Aider edits code like a boss.

But here’s my hot take – the one Google won’t tweet: This is MLX’s Manhattan Project moment for Apple Silicon. Remember TensorFlow democratizing GPUs in 2015? Local inference was a pipe dream for normies. Now? Pip install, and your MacBook outpaces cloud latencies for real work. Prediction: By 2025, devs ditch $20/month APIs for this. Edge AI just ate AWS’s lunch.

Tested the gamut.

Client Status Notes

PydanticAI Tested (6/6) Streaming, structured output, multi-tool

LangChain Tested (6/6) Tools, streaming, structured output

Aider Tested CLI edit-and-commit workflow

Client	Status	Notes
PydanticAI	Tested (6/6)	Streaming, structured output, multi-tool
LangChain	Tested (6/6)	Tools, streaming, structured output
Aider	Tested	CLI edit-and-commit workflow

Every ‘Tested’ has repo scripts. Not vaporware.

Does Your Mac Have Enough Juice?

RAM rules all.

16GB Air? Qwen3.5-4B at 168 tok/s. Chatty enough.

32GB Pro? Gemma 4 26B-A4B, 85 tok/s. Tools galore.

64GB Mini? Qwen3.5-35B, 83 tok/s. Sweet spot.

96GB beasts? Qwen3.5-122B at 57 tok/s. Frontier stuff.

Prompt cache is the secret sauce – caches your agent framework’s nagging system prompts. Follow-ups? 2-10x faster TTFT. OutputRouter? Token-level smarts splitting content, reasoning, tools. No post-processing slop.

Ollama fans, don’t cry. It’s solid. But Rapid-MLX laps it on Apple metal. Google’s open-weight play? Smart. They know closed models like GPT-5 hoard the good stuff. Gemma 4 teases power – and MLX delivers it locally.

Skeptical? rapid-mlx models lists ‘em. Docker for LibreChat, Open WebUI. Cursor, Continue.dev? Point and shoot.

Is Rapid-MLX Ollama’s Nightmare?

Short answer: Yes. For Mac users.

Ollama’s cross-platform king. But on Silicon? MLX owns the kernels. Decode speeds tell the tale. Tools? Broader out-the-box. And that pip? Chef’s kiss.

Corporate spin check: Google hypes Gemma 4 as ‘most capable open.’ Capable? Sure. But without Rapid-MLX, it’s warehouse-bound. Credit where due – this combo exposes cloud emperors naked.

Wandered off? Back to it. Your 32GB Mac now runs agentic workflows offline. No data leaks. No bills. Privacy win.

Downsides? Windows/Linux laggards wait. MLX is Apple-first. Fair.

Tinkerers, rejoice. This ain’t hype. It’s horsepower.

Why Developers Ditch Cloud Now?

Latency kills agents. Local? Sub-second loops. Tools fire instantly.

Aider workflow? Modified Python files smoothly. LangChain multi-tool chains? Flawless.

Historical parallel: CUDA in 2006. Nvidia won AI by owning hardware accel. Apple? MLX does that for Silicon. Bold call – expect forks, competitors, but Rapid-MLX sets the bar.

Enough geekery.

🧬 Related Insights

Read more: Solana Frontend Dev: Fast Chains, Stubborn UX Hurdles
Read more: New Mexico’s Meta Ruling Could Kill Encryption Dead

Frequently Asked Questions**

How do I run Gemma 4 on Apple Silicon?

pip install rapid-mlx; rapid-mlx serve gemma-4-26b. API at localhost:8000/v1.

Gemma 4 vs Ollama: Which is faster on Mac?

Rapid-MLX wins decode (85 vs 75 tok/s on 26B). Tools broader.

Does Rapid-MLX support tool calling for Gemma 4?

Yes, native parser. Works with PydanticAI, LangChain, Aider out-of-box.

Gemma 4 on Apple Silicon: 85 tok/s Easy

Key Takeaways

Why Local Gemma 4 Crushes Your Cloud Tab

Does Your Mac Have Enough Juice?

Is Rapid-MLX Ollama’s Nightmare?

Why Developers Ditch Cloud Now?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

Why Local Gemma 4 Crushes Your Cloud Tab

Does Your Mac Have Enough Juice?

Is Rapid-MLX Ollama’s Nightmare?

Why Developers Ditch Cloud Now?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

Hermes Agent: Self-Hosted AI That Turns Your Terminal into a Brain

Gemma 4 at 21 tok/s on Ryzen Mini PC: Vulkan's Messy Win

Gemma 4's Supervisor Trick: Why Multi-Agent Systems Finally Don't Suck

Gemma 4 Compresses Frontier AI into Everyday Code

Stay in the loop

Key Takeaways