Gemma 4 21 tok/s on Ryzen Mini PC: llama.cpp Guide

Forget cloud LLMs. A $500 Ryzen mini PC cranks Gemma 4 at 21 tokens per second—locally. But it's a Vulkan-fueled headache that exposes local AI's dirty secrets.

Gemma 4 at 21 tok/s on Ryzen Mini PC: Vulkan's Messy Win — theAIcatchup

Key Takeaways

  • Ryzen mini PC with 96GB RAM hits 21 tok/s on Gemma 4 via llama.cpp Vulkan—no cloud needed.
  • Setup's messy: BIOS tweaks, compiles, OOM fights. Not for casuals.
  • Unique edge: Local APIs mimic OpenAI, perfect for VS Code/Copilot alternatives.

Local AI delusion.

That’s the vibe when you promise blazing chatbots on your desk. But here’s a Ryzen mini PC—Minisforum UM760 Slim, 96GB RAM, Radeon 760M iGPU—hitting ~21 tokens per second on Gemma 4. Not vaporware. Real Ubuntu sweat.

Hands-on guide based on a real setup: Ubuntu 24.04 LTS, AMD Radeon 760M (Ryzen iGPU), lots of RAM (e.g. 96 GiB), llama.cpp built with GGML_VULKAN, OpenAI-compatible API via llama-server, Open WebUI in Docker…

They nailed it. Or did they? This isn’t plug-and-play paradise. It’s Vulkan wrangling, OOM panics, and BIOS tweaks that’d make your grandma curse.

Why Chase 21 Tokens/Second on a Mini PC?

Cloud vendors laugh at you. $0.50 per million tokens? Sure, if you’re made of money. Local setup? Free after hardware. But expect 4 hours compiling llama.cpp. And that’s the fun part.

Picture this: You’re SSH’d into a headless Ubuntu box. wget –continue chugs a 15GB GGUF. Screen session, because who trusts flaky WiFi? Then llama-server –ngl 40 -c 4096. Boom—Open WebUI loads models like a boss. Chat spits coherent code. No latency tax.

Ryzen 5 7640HS isn’t a beast. Shared iGPU VRAM sips from system RAM. 96GB DDR5? Overkill for most, perfect here. Tokens fly at 21/s on Gemma 4 26B A4B Instruct. Q4_K_M quantized. Balance of speed and smarts.

But wait. Swap to Llama 3.1 8B Q8_0? Slower layers offload. Still usable. DeepSeek Coder? Same playbook. It’s a model zoo on your LAN.

Is Vulkan Worth the Build Hell?

Short answer: Yes. ROCm? AMD’s half-baked dream for pros. Vulkan via Mesa? Plug it into llama.cpp with -DGGML_VULKAN=1. glslc gripes on Ubuntu 24.04? Chase deps like a maniac.

Here’s the thing—iGPU acceleration without dedicated cards. No $2k Nvidia tax. But compile times? 30-60 minutes on this rig. Fresh clones, cmake ninja. Fail once, debug Vulkan ICD. Joy.

Vulkan + Mesa is usually simpler than ROCm for llama.cpp inference.

Simpler? Tell that to the noob hitting “Failed to fetch models” in Open WebUI. Docker proxy woes. Ollama interference. It’s a rabbit hole.

Unique twist: This echoes 90s Linux tinkerers squeezing Quake off GeForce 256. Back then, proprietary drivers ruled. Now? Open-source Vulkan democratizes AI inference. Mini PCs as $500 servers—cloud’s nightmare.

The OOM Trap and RAM Lies

96GB sounds nuts. Truth? Gemma loads to system RAM, iGPU computes slices. Drop -ngl below 40, trim context to 4k tokens. Still OOM? Quant lower. Q2_K? Garbage output, zippy speed.

Systemd service hides the pain. ExecStart=/path/to/llama-server -m gemma.gguf –host 0.0.0.0 –port 8080 -ngl 40 -c 4096. User match critical—don’t daemon-privilege your way to segfaults.

Test it. llama-cli -m model.gguf -p “Write code” -n 128. Tokens/sec clock in. llama-bench for science. My run: 21.3 on Gemma 4. Laughable vs A100s. Heroic for a desk box.

Corporate spin? Google pushes Gemma as “lightweight.” Ha. 26B params need muscle. This setup calls bluff—runs uncensored, offline. No data leaks. PR teams hate it.

Open WebUI: Docker Savior or Port 3000 Headache?

Pull image. docker run -d -p 3000:8080 –add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://host.docker.internal:8080 –name open-webui –restart always ghcr.io/open-webui/open-webui:main.

Tweak for llama-server. No Ollama confusion. Models list? Check. Chat? Snappy. Browsing? Nope—hallucinations galore. That’s local limits, folks. Embrace the suck.

VS Code? Continue.dev or OpenCode plugin hits /v1/chat/completions. Same API. Code autocomplete on LAN. Dream for devs ditching Copilot subs.

Troubleshooting: Where Dreams Die

Vulkan blank? lspci | grep VGA. Mesa drivers? apt install mesa-vulkan-drivers vulkan-tools. glslc missing? Snap it or build.

Headless server? No desktop waste. RDP via xrdp if desperate. Ports: 8080 backend, 3000 UI. Firewall? ufw allow.

RAM tight? 32GB minimum, pray. 64GB sweet. Disk? NVMe rules.

Prediction: 2025 sees these mini PCs as dev team staples. Cloud GPUs for training only. Inference? Local Vulkan armies. AMD wins quiet.

But hype check—it’s not ChatGPT. Slower. Fiddly. Wrong answers confident as hell. Perfect for tinkerers. Normies? Stick to Poe.com.

Closing rant. This guide’s gold, but messy truth: Local AI’s punk rock. Raw, unreliable, rewarding. Cloud’s pop. Your call.

Why Does Local Gemma 4 Beat Cloud for Devs?

Cost. Privacy. Control. No vendor lock. API identical—swap endpoints, done. Devs hack faster sans rate limits.

Limits? Model size caps at RAM. No 405B here. But 70B? With 128GB upgrades, why not.

Will Ryzen Mini PCs Replace GPU Servers?

Not fully. Training? No. Inference? For chats, code? Absolutely crushing it.


🧬 Related Insights

Frequently Asked Questions

What hardware runs Gemma 4 at 21 tok/s?

Minisforum UM760 Slim (Ryzen 5 7640HS, Radeon 760M, 96GB RAM) on Ubuntu 24.04 with llama.cpp Vulkan build.

How to fix Open WebUI no models error?

Point OLLAMA_BASE_URL to http://host.docker.internal:8080. Ensure llama-server runs first. Match users, check ports.

Best quantization for speed vs quality on iGPU?

Q4_K_M. Balances 20+ tok/s with decent smarts. Q8_0 for quality, slower layers.

Aisha Patel
Written by

Former ML engineer turned writer. Covers computer vision and robotics with a practitioner perspective.

Frequently asked questions

What hardware runs Gemma 4 at 21 tok/s?
Minisforum UM760 Slim (Ryzen 5 7640HS, Radeon 760M, 96GB RAM) on Ubuntu 24.04 with llama.cpp Vulkan build.
How to fix Open WebUI no models error?
Point OLLAMA_BASE_URL to http://host.docker.internal:8080. Ensure llama-server runs first. Match users, check ports.
Best quantization for speed vs quality on iGPU?
Q4_K_M. Balances 20+ tok/s with decent smarts. Q8_0 for quality, slower layers.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.