Beginner

Kubernetes Device Plugins for GPUs

Learn how the Kubernetes device plugin framework exposes GPU hardware to containers, install the NVIDIA device plugin, and configure pods to request GPU resources.

How Device Plugins Work

The Kubernetes Device Plugin Framework allows hardware vendors to advertise specialized resources to the kubelet without modifying Kubernetes core code. For GPUs, NVIDIA provides an official device plugin that:

Discovers NVIDIA GPUs on each node
Reports GPU count to the Kubernetes API as nvidia.com/gpu extended resources
Allocates specific GPU devices to containers at runtime
Manages device health monitoring and reporting

Installing the NVIDIA Device Plugin

The recommended approach is to use the NVIDIA GPU Operator, which handles all dependencies. For manual installation:

# Deploy the NVIDIA device plugin as a DaemonSet
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml

# Verify the plugin is running on GPU nodes
kubectl get pods -n kube-system -l app=nvidia-device-plugin-daemonset

# Check that GPUs are reported as allocatable resources
kubectl describe node gpu-node-01 | grep nvidia.com/gpu

💡

Prerequisites: GPU nodes must have NVIDIA drivers installed and the NVIDIA Container Toolkit configured. The GPU Operator can automate this entirely.

Requesting GPUs in Pods

Once the device plugin is running, pods can request GPUs using the resources.limits field:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: 1  # Request 1 GPU
    command: ["python", "train.py"]
  restartPolicy: Never

Key rules for GPU resource requests:

GPUs can only be specified in limits, not requests (they are always equal)
GPUs are integers — you cannot request fractional GPUs (without time-slicing or MIG)
A container cannot share a GPU with another container (by default)
Pods requesting GPUs will only be scheduled on nodes with available GPUs

GPU Operator vs Manual Setup

Feature	GPU Operator	Manual Setup
Driver installation	✓ Automatic	Manual per node
Container runtime	✓ Auto-configured	Manual configuration
Device plugin	✓ Managed	Manual DaemonSet
DCGM monitoring	✓ Included	Separate install
GPU Feature Discovery	✓ Included	Separate install
Day-2 operations	✓ Automated upgrades	Manual upgrades

✅

Recommendation: Use the NVIDIA GPU Operator for production environments. It simplifies the entire GPU stack lifecycle and supports automated driver upgrades, making operations significantly easier.

← Previous Introduction Next → Time-Slicing