525.5 tokens per second. That’s what MLX clocks on a Qwen3-0.6B model running 4-bit quantized on an M4 Max with 128GB RAM – an 87% leap over llama.cpp’s 281.5.
And here’s the kicker: this isn’t some lab toy. It’s your daily driver for code gen, writing, reasoning – right now, on a Mac Mini that fits in your backpack.
Apple Silicon’s turning the AI world upside down. Remember when GPUs democratized deep learning a decade ago? This feels like that, but cozier, more personal. Every Mac’s unified memory becoming a warp-speed inference engine. We’re talking tokens flying faster than cloud APIs charge you per million.
Why Memory Bandwidth – Not Cores – Caps Your LLM Dreams
LLM generation? It’s a bandwidth hog. Each token demands slurping the whole model’s weights from memory. Compute sits idle, twiddling thumbs.
Simple math sets the roof: Max tok/s = Bandwidth (GB/s) ÷ Model size (GB). M4’s 120 GB/s means ~30 tok/s theoretical for a 4GB 7B Q4 model. M4 Max? 546 GB/s rockets that to 136.
Real life? 60-80% of theory after KV cache drama and kernel fuss. But zoom in: M4 Pro’s 273 GB/s – 2.3x base M4 – delivers 2x+ speed. GPU cores? Meh, base M4’s 10 vs Pro’s 16 barely nudges it.
| Chip | Bandwidth | 7B Q4 (~4GB) | 14B Q4 (~8GB) | 32B Q4 (~18GB) |
|—|—|—|—|—|
|M4 | 120 GB/s | ~30 tok/s | ~15 tok/s | ~6.7 tok/s |
|M4 Pro | 273 GB/s | ~68 tok/s | ~34 tok/s | ~15 tok/s |
|M4 Max | 546 GB/s | ~136 tok/s | ~68 tok/s | ~30 tok/s |
Quantize to Q4 from FP16? Boom – 4x throughput. Less data dragged around.
My bold call – and this is the insight the benchmarks miss: Apple Silicon’s bandwidth blitz mirrors the PC revolution of the ’80s. Back then, IBM clones crushed mainframes for personal compute. Today? Your $1,500 Mac Mini slays $10k GPU rigs for solo AI workflows. Cloud’s toast for devs who hate latency bills.
Is MLX Really 87% Faster on Apple Silicon?
Groundy benchmarks don’t lie. On M4 Max:
Model Quant MLX llama.cpp MLX Advantage Qwen3-0.6B 4-bit 525.5 281.5 +87% Llama-3.2-1B 4-bit 461.9 331.3 +39% Qwen3-8B 4-bit 93.3 76.9 +21%
Pattern screams: MLX dominates under 14B params. At 27B? Tied – bandwidth wall hits both. 2026 shift: MLX matured from sketchy experiment to inference overlord.
But — plot twist — it’s prefill greedy. Long inputs? MLX gulps 49 seconds upfront on 8.5K context (Qwen3.5-35B-A3B, M1 Max), 94% of total time.
The UI reported 51 tok/s for MLX generation, but effective wall-clock throughput was only 3 tok/s because 94% of time was spent prefilling.
(Source: famstack.dev)
Short prompts, long outputs? MLX wins big. RAG chats? llama.cpp sneaks ahead on TTFT.
Ollama + MLX: 93% Decode Boost, Zero Sweat
Ollama 0.19+ flips the script. MLX backend auto-kicks on 32GB+ Macs.
Metric llama.cpp (0.18) MLX (0.19) Improvement Prefill 1,147 tok/s 1,804 tok/s +57% Decode 57.8 tok/s 111.4 tok/s +93% Total duration 4.2s 2.3s -45%
(Qwen3.5-35B-A3B, M4 Max 64GB. DEV Community)
Just: export OLLAMA_MLX=1 then ollama run qwen3:14b. Done. No 8GB M4 sadness – needs 32GB unified magic.
vLLM-MLX? For serving hordes, 2-3.4x batch scaling. But solo? Ollama rules.
Why Does Quantization Matter – And What’s Q4_K_M?
Legacy Q4_0? Flat, clunky blocks. K-quants? Super-blocks (256 values), quantized scales per chunk. 3.3% quality dip, 75% size slash.
Sweet spot: Q4_K_M. Balances speed, smarts. Q5_K_M if you’re picky; Q6_K for near-FP16 fidelity.
On my 32GB M4 Mac Mini – weeks of tweaks – Qwen 3.5 9B daily shreds tasks. DeepSeek R1 Distill 14B reasons like a boss. Qwen 3.5 35B-A3B (MoE wizard). OpenAI gpt-oss-20B for flair.
Match to needs: Context hog? Slimmer wins. Quality freak? Higher K.
Top Picks for 32GB Macs – And Beyond
M4 Max 128GB? 70B Q4 at ~14 tok/s. Feasible.
But bottleneck’s bandwidth, always. Stack more RAM, yes – but Pro/Max chips first.
Here’s the wonder: This ecosystem – MLX, Ollama, K-quants – matured overnight. 2026 feels like AI’s personal computing dawn. No more pinging OpenAI’s servers. Your Mac, your rules, warp speed.
Prediction: By 2027, 80% of indie devs ditch cloud for Apple Silicon stacks. Hype? Nah – benchmarks scream it.
🧬 Related Insights
- Read more: Grafana Cloud’s 100x Query Limit: Why Log Hunters Are Suddenly Watching Every Scan
- Read more: Quantum Harvest Is Here: Harden Your TLS Stack Against Tomorrow’s Decryption
Frequently Asked Questions
What is MLX for Apple Silicon LLM inference?
Apple’s open-source framework tuned for unified memory – crushes llama.cpp by optimizing Metal shaders and bandwidth for models under 14B.
How do I enable MLX in Ollama?
Update to 0.19+, set export OLLAMA_MLX=1, run ollama run [model]. Auto-activates on 32GB+ Macs.
Is MLX faster than llama.cpp on M4 Macs?
Yes, 20-87% for small/medium models (under 14B). Long prefill hurts TTFT; pick llama.cpp for RAG.