A platform engineer at a mid-sized startup just got pinged in Slack: “Can we run our LLM on Kubernetes?”
Six months ago, that question would’ve drawn nervous laughter. Today, it’s the default. And the answer is yes—but not in the way most people think.
The real story isn’t about whether Kubernetes can run AI workloads. It’s that Kubernetes has quietly evolved into the operating system for AI engineering in production, and the cloud native ecosystem has been systematically building the infrastructure layer that bridges the gap between a working model and a reliable system.
The AI-Infrastructure Translation Problem
Here’s the friction: AI engineers think in terms of model accuracy, latency, and token throughput. Infrastructure engineers think in terms of resource utilization, availability, and cost. They’re speaking different languages, deploying on the same platform, and nobody’s happy.
That gap is the real problem—not Kubernetes itself.
According to the CNCF and SlashData State of Cloud Native Development report, only 41% of professional AI developers identify as cloud native. Most came through data science backgrounds where Jupiter notebooks and managed environments handled the operational heavy lifting. Meanwhile, platform engineers see GPU-hungry, stateful, long-running workloads and think: “This isn’t what Kubernetes was designed for.”
Both are right. Both are missing the point.
What Changed (And Why It Matters Now)
The cloud native ecosystem didn’t rewrite Kubernetes for AI. Instead, it built a coherent stack of patterns and projects that map AI engineering problems onto capabilities Kubernetes has been developing for years.
Start with scheduling. In 2025, Dynamic Resource Allocation (DRA) reached general availability in Kubernetes 1.34. This is the unsexy but critical stuff. DRA replaces the crude limitations of device plugins with topology-aware GPU scheduling using declarative ResourceClaims and CEL-based filtering. If you’ve ever watched GPU clusters sit at 40% utilization because scheduling was dumb, you understand why this matters.
“DRA is a significant step forward for teams managing GPU clusters,” the original analysis notes.
Then there’s the Inference Gateway—which also just hit GA. This is where things get interesting for multi-model serving. Instead of running separate inference servers for different models, platform teams can now route traffic based on model names, LoRA adapters, and endpoint health using Kubernetes-native APIs. That means more utilization. Fewer accelerators sitting idle. Lower costs.
The newly formed WG AI Gateway is already pushing further: token-based rate limiting, semantic routing, payload processing for prompt filtering. These aren’t academic projects. They’re solving real production problems that enterprises are hitting right now at scale.
Why This Isn’t Hype (And Why It Matters)
Look at the observability layer. Most AI platforms are blind to what matters: tokens per second, time to first token, queue depth, cache hit rates. OpenTelemetry and Prometheus are getting instrumentation for these metrics, and inference-perf benchmarking tools are standardizing measurement across model servers.
For workflows, Kubeflow has grown into a top-30 CNCF project with hundreds of active contributors. Kueue handles job queuing for batch and training. OPA and SPIFFE/SPIRE provide governance—controlling which teams can access which models, establishing workload identity across inference services.
And here’s the part that actually matters: this all works together. GitOps patterns from Argo and Flux apply to model serving. The same declarative, version-controlled deployment philosophy that works for web services matters even more when a bad model version can produce incorrect outputs.
The stack is coherent. It’s not perfect, but it’s real.
Where This Actually Breaks Down
Before we declare victory: there’s a reason only 82% of container users run Kubernetes in production (the survey’s number), and adoption among AI-specific workflows lags behind general container usage. Integration is messy. Most AI teams don’t think in Kubernetes abstractions. Most Kubernetes teams don’t understand inference serving patterns.
That gap won’t close overnight. The CNCF’s recent work on standardization helps, but it’s still early. Managed services (AWS SageMaker, Anthropic’s API, Replicate) will keep winning for teams that don’t want to operate infrastructure. They should.
But for enterprises building proprietary models at scale, for teams with governance requirements, for anyone who needs to control cost-per-inference across a GPU fleet—Kubernetes is no longer the weird choice. It’s the platform.
What This Means for Your Team
If you’re an AI engineer moving to Kubernetes: start with the inference serving stack. Deploy a model server (vLLM, TensorFlow Serving) behind the Inference Gateway. Use DRA to manage GPU resources declaratively. Instrument with OpenTelemetry from day one. The patterns will feel familiar if you’ve worked with request-response services at scale.
If you’re a platform engineer supporting AI teams: stop treating inference services like traditional stateless workloads. They need autoscaling based on token throughput, not CPU. Training jobs are long-running and span multiple nodes with specialized interconnects. The infrastructure-as-code patterns work the same way; the parameters change.
The hard part isn’t technical. It’s organizational—getting the two communities to stop talking past each other.
🧬 Related Insights
- Read more: GitLab’s Package Repository Overhaul: What DevOps Teams Must Do Before September 2026
- Read more: Nine Vulnerabilities Expose IP KVMs as the Skeleton Key to Your Entire Network
Frequently Asked Questions
Can you run AI models on Kubernetes in production?
Yes. 82% of container users already run Kubernetes in production, and the ecosystem now has purpose-built tools for inference serving (Gateway API Inference Extension), GPU scheduling (Dynamic Resource Allocation), and observability. It’s production-ready as of 2025.
What’s the difference between Kubernetes and managed AI services?
Managed services (AWS SageMaker, Replicate) handle infrastructure for you. Kubernetes gives you control over cost, governance, and model versions—but you operate it. Choose based on whether you need that control or just want models to work.
Do I need to learn cloud native stuff to run AI models?
Depends on your role. AI engineers can start with inference serving patterns without deep Kubernetes knowledge. Platform engineers need to understand new workload types (GPU scheduling, token-based autoscaling). The gap is narrowing, but it still exists.