Gemma 4 flies on Macs.
Last week, Google dropped their beefiest open-weight family yet. Hours later? It’s screaming at 85 tokens per second on my M3 Ultra Mac, tools calling, streaming, OpenAI API ready for any framework you throw at it. And yeah, Gemma 4 on Apple Silicon is the phrase you’ll Google next.
pip install rapid-mlx rapid-mlx serve gemma-4-26b
That’s it. No PhD required. Downloads a 14GB 4-bit MLX-quantized beast and fires up localhost:8000/v1. Boom.
Look, I’ve benchmarked the hell out of this. Same M3 Ultra (192GB RAM), same Gemma 4 26B-A4B 4-bit model, same prompt.
Engine Decode (tok/s) TTFT Notes Rapid-MLX 85 tok/s 0.26s MLX-native, prompt cache mlx-vlm 84 tok/s 0.31s VLM library (no tool calling) Ollama 75 tok/s 0.08s llama.cpp backend
Rapid-MLX edges Ollama by 13% on decode – the part you actually feel chatting away. Ollama wins TTFT with its Metal prefill tricks, but who cares when decode drags? Smaller models? 168 tok/s on Qwen3.5-4B. Ollama limps at 70. That’s 2.4x, folks.
Why Local Gemma 4 Crushes Your Cloud Tab
Tool calling. Most local servers fake it or lock to one family. Rapid-MLX? 18 parsers baked in – Qwen, Gemma 4’s native <|tool_call> junk, GLM-4.7, Llama 3, you name it. No flags. Just works.
Here’s the curl magic:
curl http://localhost:8000/v1/chat/completions \ -H “Content-Type: application/json” \ -d ‘{ “model”: “default”, “messages”: [{“role”: “user”, “content”: “What is the weather in Tokyo?”}], “tools”: [{ “type”: “function”, “function”: { “name”: “get_weather”, “description”: “Get weather for a city”, “parameters”: { “type”: “object”, “properties”: {“city”: {“type”: “string”}}, “required”: [“city”] } }] }’
It spits back perfect JSON args. Even Gemma’s quirky unquoted numbers like {a: 3, b: 4}. No regex hacks needed.
OpenAI-compatible means frameworks swarm. PydanticAI agents? Check. LangChain tools? Streaming structured output? Multi-turn? All greenlit. Aider edits code like a boss.
But here’s my hot take – the one Google won’t tweet: This is MLX’s Manhattan Project moment for Apple Silicon. Remember TensorFlow democratizing GPUs in 2015? Local inference was a pipe dream for normies. Now? Pip install, and your MacBook outpaces cloud latencies for real work. Prediction: By 2025, devs ditch $20/month APIs for this. Edge AI just ate AWS’s lunch.
Tested the gamut.
Client Status Notes PydanticAI Tested (6/6) Streaming, structured output, multi-tool LangChain Tested (6/6) Tools, streaming, structured output Aider Tested CLI edit-and-commit workflow
Every ‘Tested’ has repo scripts. Not vaporware.
Does Your Mac Have Enough Juice?
RAM rules all.
16GB Air? Qwen3.5-4B at 168 tok/s. Chatty enough.
32GB Pro? Gemma 4 26B-A4B, 85 tok/s. Tools galore.
64GB Mini? Qwen3.5-35B, 83 tok/s. Sweet spot.
96GB beasts? Qwen3.5-122B at 57 tok/s. Frontier stuff.
Prompt cache is the secret sauce – caches your agent framework’s nagging system prompts. Follow-ups? 2-10x faster TTFT. OutputRouter? Token-level smarts splitting content, reasoning, tools. No post-processing slop.
Ollama fans, don’t cry. It’s solid. But Rapid-MLX laps it on Apple metal. Google’s open-weight play? Smart. They know closed models like GPT-5 hoard the good stuff. Gemma 4 teases power – and MLX delivers it locally.
Skeptical? rapid-mlx models lists ‘em. Docker for LibreChat, Open WebUI. Cursor, Continue.dev? Point and shoot.
Is Rapid-MLX Ollama’s Nightmare?
Short answer: Yes. For Mac users.
Ollama’s cross-platform king. But on Silicon? MLX owns the kernels. Decode speeds tell the tale. Tools? Broader out-the-box. And that pip? Chef’s kiss.
Corporate spin check: Google hypes Gemma 4 as ‘most capable open.’ Capable? Sure. But without Rapid-MLX, it’s warehouse-bound. Credit where due – this combo exposes cloud emperors naked.
Wandered off? Back to it. Your 32GB Mac now runs agentic workflows offline. No data leaks. No bills. Privacy win.
Downsides? Windows/Linux laggards wait. MLX is Apple-first. Fair.
Tinkerers, rejoice. This ain’t hype. It’s horsepower.
Why Developers Ditch Cloud Now?
Latency kills agents. Local? Sub-second loops. Tools fire instantly.
Aider workflow? Modified Python files smoothly. LangChain multi-tool chains? Flawless.
Historical parallel: CUDA in 2006. Nvidia won AI by owning hardware accel. Apple? MLX does that for Silicon. Bold call – expect forks, competitors, but Rapid-MLX sets the bar.
Enough geekery.
**
🧬 Related Insights
- Read more: Solana Frontend Dev: Fast Chains, Stubborn UX Hurdles
- Read more: New Mexico’s Meta Ruling Could Kill Encryption Dead
Frequently Asked Questions**
How do I run Gemma 4 on Apple Silicon?
pip install rapid-mlx; rapid-mlx serve gemma-4-26b. API at localhost:8000/v1.
Gemma 4 vs Ollama: Which is faster on Mac?
Rapid-MLX wins decode (85 vs 75 tok/s on 26B). Tools broader.
Does Rapid-MLX support tool calling for Gemma 4?
Yes, native parser. Works with PydanticAI, LangChain, Aider out-of-box.