The error message appeared at 2 AM: unknown model architecture: 'gemma4'. Google had released Gemma 4 the day before, and the standard llama.cpp container couldn’t load it. Most people would wait for the official image. This engineer didn’t.
Two hours later, the model was live on a home Kubernetes cluster, running inference at 96 tokens per second, and fixing actual bugs pulled from a real codebase. Not benchmark numbers. Not toy examples. Production-grade code generation.
This is the story behind that pivot—and what it tells us about how the infrastructure around open-source AI models has fundamentally shifted.
What Actually Happened
The setup was modest: two RTX 5060 Ti GPUs (16GB each), an AMD Ryzen 9, 64GB of RAM, and MicroK8s running on Ubuntu. Call it a high-end gaming PC dressed up as a data center. The builder—someone who’d already written LLMKube, a custom Kubernetes operator for llama.cpp inference—recognized the architecture problem immediately.
Gemma 4 support wasn’t in the released llama.cpp binaries yet. It was only in HEAD. So instead of waiting, they did what the infrastructure enables you to do now: they built it themselves.
A Dockerfile. A Kaniko build job running on the same Kubernetes cluster. Fifteen minutes of compilation targeting CUDA SM 86 and SM 120 (Ampere and Blackwell). A push to a local container registry. No external CI, no waiting for GitHub Actions—the cluster built its own inference server while running inference jobs.
“The operator downloaded the model, created the Deployment with the right GPU flags, set up health probes, and exposed an OpenAI-compatible endpoint. From the deploy command to the first inference request was about 3 minutes.”
That’s the speed we’re talking about.
The Performance Numbers Actually Matter
Sure, Gemma 4 26B generated text at 96 tokens per second on a single request. Under concurrent load (110 requests total), it sustained 170 tokens per second. The official benchmarks floating around claim 40 tok/s on consumer hardware. That’s not a marginal improvement—that’s a 2.4x gap.
The MoE architecture (mixture of experts) is the why. Only 4 billion parameters activate per token, even though the model is 26 billion. The quantization to Q4_K_M brings it down to 15.6 GB. Split across two GPUs, there’s still 16 GB of VRAM left for KV cache at 32K context length. Tensor parallelism works. The math works.
But numbers from synthetic benchmarks don’t prove anything. Real code does.
Three Tests That Actually Mattered
The engineer fed Gemma 4 three actual problems from their own infrastructure project.
Bug 1: Kubernetes deadlock. GPU workloads were getting stuck during rolling updates—new pods couldn’t schedule because old pods held the GPUs, and old pods wouldn’t terminate because they were waiting for new pods to become ready. A classic resource contention deadlock.
Gemma 4’s response: correctly identified that Kubernetes GPU workloads should use Recreate strategy instead of RollingUpdate, with a conditional check on GPU count. It showed its reasoning. It considered edge cases. It verified the pattern before output.
Time to a working solution: 10.6 seconds.
Bug 2: Orphaned endpoints. Deleting an InferenceService was leaving zombie Endpoints in the cluster—DNS records pointing nowhere, cleanup logic missing.
Gemma 4 wrote an UnregisterEndpoint method in Go. Complete. DNS name sanitization. Service and Endpoints deletion. NotFound error handling. Logging. Production-quality on the first try. No hallucinations. No corrections needed.
11.1 seconds.
Bug 3: Test cases. The engineer asked it to write tests following an existing pattern in the codebase. Four correct test cases. BeforeEach setup. Proper assertions. The exact Gomega matchers already in use. ContainElements for present checks, NotTo(ContainElement()) for absent checks. It matched the local conventions—not generic Go patterns, but this specific codebase’s conventions.
12.3 seconds.
Why This Matters (And Why It Doesn’t)
Listen, Gemma 4 doesn’t replace Claude or GPT-4. On genuinely complex multi-step reasoning problems, the thinking is shallower. It’ll occasionally hit the token limit and cut off mid-response. It’s not better than the frontier models for novel, ambiguous problems.
But that’s not the real insight here.
The architectural shift is about friction. Three years ago, deploying a new open-source model meant waiting weeks for community builds, fighting with Docker image compatibility, wrestling with CUDA versions, probably downgrading your GPU driver to something that actually shipped with the right libraries.
Now? Write a Dockerfile, run it on the same cluster that serves inference. Three minutes from deploy command to live requests.
That compressed timeline changes what’s possible. It means you’re not locked into whatever foundation model shipped six months ago with your hardware. You can run experiments at model release velocity. You can test hypotheses in hours instead of planning sprints around it.
And when a new model drops that’s designed for your specific problem—smaller, faster, more efficient—you’re not stuck with migration plans and retraining. You spin it up while the hype dies down and evaluate it yourself, on your hardware, with your data.
That’s not revolutionary. It’s infrastructural. It’s the boring, obvious end state that only looks revolutionary because we’ve been stuck in the friction regime for so long.
The Missing Piece
One thing worth noting: this engineer built LLMKube themselves. Not everyone has that. A Kubernetes operator that handles model downloads, GPU allocation, health probes, and OpenAI-compatible endpoints is… really useful. And it’s not a standard tool in the ecosystem yet.
There are open-source alternatives (vLLM, TensorRT, LocalAI), but they solve different problems or require different infrastructure. The fact that someone had to write a custom operator to do what seems like a basic task—“download model, allocate GPU, serve it with an API”—suggests the tooling is still fragmented.
That fragmentation is the real throttle on adoption.
FAQs
How fast is Gemma 4 compared to other open-source models? At Q4_K_M quantization on dual RTX 5060 Ti GPUs, it does 96 tok/s single-request and 170 tok/s aggregate throughput. For reference, Llama 3.1 70B on similar hardware (single GPU) does ~30 tok/s. The MoE design (only 4B active parameters per token) is the efficiency driver.
Can I deploy Gemma 4 without building from source? Not yet—the official llama.cpp release didn’t include Gemma 4 architecture support. You either build from HEAD or wait for the next llama.cpp release. This is temporary; expect official images within weeks.
Does Gemma 4 actually generate production-quality code? On the three real bugs tested, yes—it generated Kubernetes-correct YAML, production Go code with error handling, and test cases matching existing conventions. But it’s not a replacement for GPT-4 on novel multi-step problems. Think of it as strong for single-issue bug fixes and code generation within known patterns.