GPUs Without the Headache: Scaling AI Infrastructure for Engineering Teams
AI adoption is accelerating, but the real bottleneck isn’t models—it’s infrastructure. GPUs are scarce, expensive, and often underutilized, leaving engineering leaders balancing innovation with spiraling costs. The challenge: how do you give teams the GPU power they need without creating an infrastructure management nightmare?
In this talk, we’ll explore how Kubernetes has evolved into the control plane not just for microservices, but for AI workloads. We’ll dive into how dynamic autoscaling with Karpenter enables flexible, on-demand allocation of GPUs across multiple tenants, ensuring resources are available when workloads spike—and released when they’re not. The result: engineering teams get the performance they need, while platform teams keep budgets and complexity under control.
Whether you’re leading platform engineering, data infrastructure, or AI initiatives, this session will provide insights from real-world scenarios and valuable guidance on how to best deliver GPUs without the headache—so your teams can focus on building, not curating bespoke infrastructure.
Chapters
Full transcript
The complete talk, organized by section.
Lukas Gentele
Hey everybody. I'm Lukas, and I'm here to talk about GPUs without the headache: how to scale your AI infrastructure for engineering teams. Thanks for joining this talk.
I want to start with something I've heard a bunch of times over the past couple of years, and that is a common myth that only frontier AI labs need GPUs: OpenAI, Microsoft, Google, Anthropic, those kinds of companies. Do I even need GPUs in my enterprise?
I think definitely you probably won't need 10,000-plus GPUs like OpenAI runs to train a frontier LLM, but you are probably going to need maybe hundreds of GPUs. Most enterprises will run GPUs in some shape or form, whether that is in the public or private cloud, whether it is on bare metal, or whether it is in VMs. It is not going to be the scale of a frontier lab.
You are probably going to need them for fine-tuning models on proprietary data. You might need them to train highly specialized models. Rather than a large LLM, you may need it for a smaller language model. You may want to run inference in production. You may want to generate embeddings for doing RAG in your applications and offering that as a service to your customers. You may want to run domain-specific ML workloads; in industries like pharma, for example, you may want to do drug discovery. Highly specialized workloads can be vastly accelerated by GPUs.
Last but not least, and very important for engineering leaders, you want to attract and retain top talent. Top talent wants to operate at the cutting edge. What is more cutting edge than the newest GPU out there, a new generation of compute? People want to experiment with these technologies. If you do not give them access, you may not get the top talent. It is like when the container age started and a company said, "No containers, no Kubernetes, we are not doing that," or when AWS became popular and two or three years in your company said, "Cloud? No, we are not doing that." You are probably losing out on talent if you do that.
But the classical model of handing out infrastructure as we have done it with CPUs does not translate well to the GPU era. Generally, an engineer needs compute and you give them a static amount of CPU. That is the traditional model. The straight arrow on the slide is probably your enterprise process to get approvals and whatnot to actually get that compute. At some point, an engineer gets some amount of CPU. Say it is an EC2 instance, a virtual machine in the public cloud. It may cost $1,600 per year. I looked at current prices; with an enterprise ELA you may get it a little cheaper from Amazon. Four CPUs for an engineer sounds great.
Maybe you do not hand out EC2 instances in a broad, archaic way anymore as you did 10 years ago before containers, but this model still works pretty well today. The price is affordable. If you look at an engineering salary, it is a drop in the bucket to say somebody needs a MacBook and an EC2 instance, and we are going to finance that.
With GPUs, if you want to give out a static amount of GPU running 24/7, 365 days for the year, for a single engineer, current AWS prices get you to about $60K for an individual engineer having one GPU attached to an EC2 instance. That is quite a difference. The classical model really breaks, most likely in terms of your budget.
So take a step back and look at the evolution of compute delivery in most enterprises. Going very far back, there were physical servers being plugged into racks. Maybe not an individual engineer, but a team got its own set of servers. We are talking about months to provision and very rigid systems, because it is hardware.
Then cloud VMs changed the game. Things are still static in terms of compute amounts, but you can provision them much faster. You provision them in days, given your enterprise approval process. Maybe it takes a week or two instead of a day, but technically you could hand out an EC2 instance within an hour after somebody needs it. Maybe you even have a self-service system with automated approvals for certain things, if you are that sophisticated and you trust your engineers to stay within budget.
Then containers changed the game again. We are talking about automatic scheduling, fully dynamic compute, and orchestration of compute in real time across a set of machines. That has its own challenges: how is the budget for this now if somebody launches three containers? CPU and memory are relatively flexible depending on what they are deploying. FinOps, chargebacks, and all those topics come to mind. But that has been the natural evolution over the past 20 or 30 years.
When you talk about GPUs and handing them out to your teams, there are key challenges. The first starts with sourcing and provisioning. You cannot just say, "I want 10,000 GPUs on Amazon tomorrow." You are going to struggle to provision them there. You might look at traditional hyperscalers, neo-clouds, or potentially building some on-prem capacity, building an AI factory in your own data centers. Maybe you need a combination of all of the above given limited supply in the market. You also may not want to be tied to one approach or vendor because of pricing pressure. You want the ability to negotiate prices and shift workloads depending on where you get the cheapest compute.
CPUs and GPUs are different in lifecycle, too. If I put a CPU in my data center, that may be good for five or seven years. You can probably do a lot of things with it. With GPUs, the lifecycles are much quicker. We need to retain flexibility.
The second challenge is allocation and sharing of resources. You want to provision quickly, reduce wait times, and ensure fair use because you are working with a constrained amount of resources. When you look at the amount of CPUs in your company today, it is probably a large number. Are you going to have as many GPUs? Probably not. The number will be much less. So we have to enable efficient sharing, prevent idle compute, and dynamically reassign things, maybe on an hourly or daily basis. It is challenging.
Last but not least, you need to control cost. In a dynamic environment, you want to set budgets, track cost, and create cost awareness within teams. That is why I showed the difference between a thousand-something dollars for an EC2 instance with CPUs versus $60K for GPUs. I am not sure every engineer in your company is aware of that. There is education to be done, and it helps drive accountability through showback so people understand the amount of cost they are producing, especially in GPU land.
If we look at the entire stack needed to get to very dynamic and flexible GPU infrastructure in an enterprise, we have to consider the things we just talked about. Where do we source GPUs from? That is the data center and infrastructure layer. We want portability across environments. Maybe we do cloud bursting, where we use our private cloud because it can potentially be operated much cheaper than what hyperscalers offer today, but then cloud burst for peak capacity. You cannot just call NVIDIA and say, "I need 200 more GPUs tomorrow." It will take a while until that delivery arrives.
The second layer is compute, where we track consumption, enforce quotas, and autoscale resources so people allocate just as much as they need to get their work done, but not so much that we deal with idle capacity. I talk to a lot of enterprises running Kubernetes at large scale, and sometimes I ask what their average utilization rate is in their private cloud data center. I hear things like 20%. That is not going to fly with GPUs. If you made a GPU investment in the private cloud and somebody runs at 20% capacity, that CIO or CTO who ultimately owns the budget is not going to be happy.
At runtime, we want to enable engineers to run compute on GPUs very quickly. There is a whole software stack necessary to run on top of GPUs: GPU Operator, the NVIDIA Container Toolkit, and auxiliary tools. We want to abstract GPU handling away from engineers so they can program and run things on GPUs rather than maintain and struggle with the infrastructure.
Looking at the history I showed earlier, I am actually seeing people go back and hand out individual machines: a DGX server, for example, or individual EC2 instances with two or three GPUs attached. In many cases with GPUs, we are taking a step back from containers, which is not ideal. I am making the case today to stay in the container age, to keep the models that work for us, but adjust them to the GPU world. Last, that also means self-service provisioning, ensuring tenant autonomy, and giving teams flexibility on top of the GPU infrastructure.
I believe Kubernetes is still the best tool to operate infrastructure, whether at very large scale or a decently small scale, because it abstracts away from the underlying infrastructure and gives portability between environments. Tools like Prometheus and OpenCost track utilization and allow chargebacks, and Kubernetes has invested a lot in becoming better suited to run GPUs. I argue that Kubernetes should remain the API for GPUs, but it needs to adjust to the realities of GPUs. I will show how some companies we work with use open source tools in the Kubernetes space to achieve this.
First, Kubernetes is used for GPUs by a lot of industry leaders. When you look at the tech blogs of frontier labs like OpenAI, they write a lot about how they use Kubernetes. Kubernetes is proven at pretty large scale for dev and experimentation. Obviously, if you are a frontier AI lab, you may have custom HPC schedulers on top, maybe running Slurm on top of Kubernetes. But Kubernetes is a great foundation for people to work with GPUs, iterate over GPUs, and ultimately get to model training.
Most mid-size AI companies and internal teams I see are already looking at Kubernetes today, especially if they operate in private cloud. In public cloud, it is easy to create an EC2 instance and attach GPUs. In private cloud, you have to figure out what to do. Constraints like VMs on GPUs not being as easy as they used to be with CPUs limit how infrastructure can be deployed.
NVIDIA also strongly supports Kubernetes. Look at the NVIDIA Container Toolkit, or the acquisition of Run:ai, which they integrated deeply into what they call the NVIDIA Enterprise AI stack. That shows that one of the core hardware vendors in the space, with dominant market share, is investing heavily in Kubernetes.
Kubernetes is also an open standard with a vast ecosystem. Kubeflow, Volcano, Ray, and other tools are built on top of Kubernetes. There are many HPC solutions for storage and networking available for Kubernetes, such as Weka and VAST. You will need more than pure compute. And you may have Kubernetes expertise in your company. Today it is table stakes: every enterprise is going to have a Kubernetes team running Kubernetes in some shape or form. Kubernetes is pretty much the de facto standard for compute.
The traditional single-cloud managed Kubernetes model looks like this: a few namespaces, which are a very soft isolation layer, and nodes from the same cloud provider at the bottom. This may be EKS with three EC2 nodes. You can have GPUs in those instances, create a Kubernetes cluster, and give tenants access to namespace A and namespace B. That is how you logically keep them separate.
But the compute layer at the bottom and the Kubernetes layer at the top are not well isolated. Tenants run on the same Kubernetes cluster and the same nodes. There are challenges.
When we started the vCluster open source project a couple of years ago, it became a popular GitHub project with about 10,000 GitHub stars and 40 million-plus image pulls. It solves the first problem: how do I hand out more Kubernetes clusters and give tenants autonomy, such as their own Kubernetes version, without lockstep upgrades for multiple teams? How do I enable them to run their own CRDs, group things themselves, and have multiple namespaces? It is a great tool to solve that logical problem at the Kubernetes API layer without actually handing out more physical Kubernetes clusters.
You could say, "Let's create 300 EKS clusters," but your CFO may not thank you. Sharing clusters, compute, and infrastructure to some extent is very beneficial, even with CPUs. vCluster lets you do that while making it feel like everybody has their own cluster. It is like virtualizing Kubernetes, similar to what VMs did to servers, enabling cloud VMs and a lot of velocity in companies. We do the same with the vCluster open source project, but for Kubernetes.
But there are two layers to the diagram: the logical layer, which is the Kubernetes API and higher-level resources, and the actual containers, network, storage, and compute nodes at the bottom. Those are still shared. Multiple tenants still run on the same node. That is potentially problematic for certain workloads, especially GPUs, where you may want to own the entire node. Noisy-neighbor problems are much worse in GPU environments, where it may not even be possible to share a GPU node.
So we changed the approach. We have been heavily investing in solving the bottom-layer problem while keeping the model of one EKS cluster with multiple vClusters on top. A vCluster control plane is a Kubernetes control plane that runs in a pod inside a namespace of an EKS, OpenShift, or Rancher cluster. It can run on any cluster; it is just a pod with a Kubernetes control plane.
What if we took a bunch of nodes and used kubeadm to join them directly into this control plane? We would get the benefit of not having completely separate clusters. We still have the shared logical layer and vCluster as a lightweight abstraction on top, but now we can get dedicated nodes for dedicated teams.
On top of this, we added Karpenter. Karpenter is an open source project AWS started a couple of years ago. At re:Invent last year, AWS announced EKS Auto Mode, which is based on Karpenter. Karpenter lets you create a Kubernetes cluster with zero nodes. When you launch a pod in that cluster, Karpenter sees the available node types, picks the best one for the workload, and joins that node into your Kubernetes cluster. As you deploy more workloads, you may have nodes of different types. When workloads go away, Karpenter may decide two underutilized nodes can be replaced by a slightly bigger node and consolidate them. That gives you dynamic autoscaling.
Google tried something similar with GKE Autopilot, but that is more of a nodeless, serverless Kubernetes approach. EKS also had Fargate as a backend. These offerings have problems for this talk: typically you cannot have GPUs in them, and they have strings attached around networking and workload types. The benefit of Karpenter is that it is a regular cluster with regular nodes; node management, choosing node types, and setting up autoscaling groups are handled by Karpenter rather than by you.
One problem with Karpenter is that you have to implement it for each cloud individually. If you bake Karpenter into EKS, it provisions EC2 instances. That makes sense. But with vCluster, you could technically have a Metal as a Service bare-metal node from your private cloud, join in an Azure VM, and have a separate vCluster that runs an EC2 instance. Baking in Karpenter across that model is more challenging than implementing it for one cloud.
We still baked it in. Karpenter creates node claims. You tell it these nodes, environments, and node types are available, and it creates node claims. We have a component called vCluster Platform that is aware of all node pools accessible to it, whether private cloud, public cloud, and so on. It talks to node providers, a concept we introduced on top of Karpenter. Those node providers spin up nodes: PXE boot a bare-metal node in your private data center, or start an EC2 instance, for example. We do the lifecycle management of the node and join it into the vCluster. You get a right-sized, autoscaled Kubernetes cluster.
We built this based on Terraform and OpenTofu because there are many Terraform and OpenTofu providers for every cloud provider, and they are well maintained. It is a solid ecosystem, even for private cloud tools like OpenStack and Metal as a Service from Canonical. They have solid Terraform providers. It works with Terraform Enterprise; you swap out the URL so Terraform Enterprise manages the state file. You can write your own Terraform and OpenTofu providers. Even many neo-clouds, such as Nebius, have well-maintained Terraform providers.
We did special integrations. One is for NVIDIA BCM, Base Command Manager, which is the node manager that runs on NVIDIA BasePOD and SuperPOD systems. That API is generally not Terraform-driven; it is more like RPC calls. We wanted to make it easy for customers running in those environments who want, for example, to run a SuperPOD in a private data center and then add EC2 nodes for cloud bursting.
The second direct integration is into KubeVirt. That is useful because when you think about a GPU cluster, you will never get away with just GPU nodes. You still have to run things like OPA, Prometheus, and OpenCost. There are lots of tools you have to run on CPU nodes. What if we could run them on bare metal in your private cloud, create a KubeVirt cluster, and create right-sized VMs? You almost get a vSphere-type experience where we right-size the cluster on top of this, while also having GPU node pools at the same time.
In vCluster YAML, you specify that it should use private nodes. The standard model I showed earlier is what we call shared nodes, where all nodes are used from the underlying cluster. That is how we started the open source vCluster project. Private nodes is something we launched in August, and Auto Nodes is what we call the Karpenter integration, launched about two weeks ago. It is easy to configure Auto Nodes: enable private nodes, then under Auto Nodes specify the machine types you want from AWS and Azure, and maybe from OpenStack or BCM specify the machines to mount in as well. That lets you create very dynamic, right-sized, and well-isolated clusters for your tenants.
Additionally, we launched another technology earlier this year called vNode. vNode creates what we call virtual nodes on individual compute nodes. Many users struggle with running on bare-metal compute. VMs on GPUs have many issues even with passthroughs. Specialized cloud providers like Nebius do that well because they can invest heavily in their own version of KubeVirt, their own virtualization, and their own flavor of KVM. Most enterprises are not going to do that. If you take KVM out of the box and run it on GPUs, there will be quite a few hiccups.
So how do you still segment workloads on these nodes? We do that with vNode. vNode is effectively a container-native technology. I tell people it is as secure as it gets without creating a VM. They are still containers. There is no separate kernel; we are still running a shared kernel. VMs will be more secure with regard to that, and if there is a kernel exploit, you can still run into issues. But the most common thing you want to protect against, especially for internal or dev environments, is people causing damage, installing something that leads to container breakout, or putting in malicious software.
For example, there was a container breakout in the NVIDIA Container Toolkit in July that led to a full root exploit. The startup of the container was the problem. It was not a running container being attacked; it was starting the container. There was an issue in the container toolkit startup. Since it needs to attach a GPU, it runs as a root-privileged process at that moment. If you pass a certain environment variable and a small script into your Dockerfile and run that with the NVIDIA Container Toolkit, boom, you are on the node and can run any C code. That is very problematic.
We launched vNode earlier this year, so when Wiz came out with a blog post showing this vulnerability, we were eager to test whether vNode protects against it. It did. We wrote an in-depth blog post about it. I also gave a talk at the AI Infra Summit about two weeks ago on how this attack played out and how vNode protects against it. It is a great tool when you want to prevent container breakouts in a bare-metal environment, which you find a lot in GPU land.
Here is an overview of all the different models: shared nodes with traditional vCluster, virtual nodes with vNode, and then private nodes and Auto Nodes. You have traditional namespaces on the left and traditional separate Kubernetes clusters, such as separate OpenShift or EKS clusters, on the other extreme. What we are trying to do with the vCluster project is offer a spectrum in the middle: efficient sharing, much more efficient than traditional clusters, but with certain degrees of isolation depending on the use case.
You may run multiple variations of this. For example, you may run a production cluster with private nodes and Auto Nodes completely separate, but then a development cluster in multi-tenant mode. In one case there is only one tenant and one node, and there is still a vNode because that additional layer may be useful even if it is a single tenant on that node and a single-tenant production cluster. On the other side, you may run multi-tenant mode with multiple tenants on the same node. Virtual nodes help isolate those tenants from each other and prevent container breakouts or people causing damage, which unfortunately happens as well.
That is pretty much the end of the talk. I know we have a minute left, so if there are any questions, I am happy to answer them. Otherwise, enjoy the rest of the last day of the conference. Thank you.