Beginner
Kubernetes Device Plugins for GPUs
Learn how the Kubernetes device plugin framework exposes GPU hardware to containers, install the NVIDIA device plugin, and configure pods to request GPU resources.
How Device Plugins Work
The Kubernetes Device Plugin Framework allows hardware vendors to advertise specialized resources to the kubelet without modifying Kubernetes core code. For GPUs, NVIDIA provides an official device plugin that:
- Discovers NVIDIA GPUs on each node
- Reports GPU count to the Kubernetes API as
nvidia.com/gpuextended resources - Allocates specific GPU devices to containers at runtime
- Manages device health monitoring and reporting
Installing the NVIDIA Device Plugin
The recommended approach is to use the NVIDIA GPU Operator, which handles all dependencies. For manual installation:
# Deploy the NVIDIA device plugin as a DaemonSet
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml
# Verify the plugin is running on GPU nodes
kubectl get pods -n kube-system -l app=nvidia-device-plugin-daemonset
# Check that GPUs are reported as allocatable resources
kubectl describe node gpu-node-01 | grep nvidia.com/gpu
Prerequisites: GPU nodes must have NVIDIA drivers installed and the NVIDIA Container Toolkit configured. The GPU Operator can automate this entirely.
Requesting GPUs in Pods
Once the device plugin is running, pods can request GPUs using the resources.limits field:
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-job
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
command: ["python", "train.py"]
restartPolicy: Never
Key rules for GPU resource requests:
- GPUs can only be specified in
limits, notrequests(they are always equal) - GPUs are integers — you cannot request fractional GPUs (without time-slicing or MIG)
- A container cannot share a GPU with another container (by default)
- Pods requesting GPUs will only be scheduled on nodes with available GPUs
GPU Operator vs Manual Setup
| Feature | GPU Operator | Manual Setup |
|---|---|---|
| Driver installation | ✓ Automatic | Manual per node |
| Container runtime | ✓ Auto-configured | Manual configuration |
| Device plugin | ✓ Managed | Manual DaemonSet |
| DCGM monitoring | ✓ Included | Separate install |
| GPU Feature Discovery | ✓ Included | Separate install |
| Day-2 operations | ✓ Automated upgrades | Manual upgrades |
Recommendation: Use the NVIDIA GPU Operator for production environments. It simplifies the entire GPU stack lifecycle and supports automated driver upgrades, making operations significantly easier.
Lilly Tech Systems