Beginner

Kubernetes Device Plugins for GPUs

Learn how the Kubernetes device plugin framework exposes GPU hardware to containers, install the NVIDIA device plugin, and configure pods to request GPU resources.

How Device Plugins Work

The Kubernetes Device Plugin Framework allows hardware vendors to advertise specialized resources to the kubelet without modifying Kubernetes core code. For GPUs, NVIDIA provides an official device plugin that:

  • Discovers NVIDIA GPUs on each node
  • Reports GPU count to the Kubernetes API as nvidia.com/gpu extended resources
  • Allocates specific GPU devices to containers at runtime
  • Manages device health monitoring and reporting

Installing the NVIDIA Device Plugin

The recommended approach is to use the NVIDIA GPU Operator, which handles all dependencies. For manual installation:

# Deploy the NVIDIA device plugin as a DaemonSet
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml

# Verify the plugin is running on GPU nodes
kubectl get pods -n kube-system -l app=nvidia-device-plugin-daemonset

# Check that GPUs are reported as allocatable resources
kubectl describe node gpu-node-01 | grep nvidia.com/gpu
💡
Prerequisites: GPU nodes must have NVIDIA drivers installed and the NVIDIA Container Toolkit configured. The GPU Operator can automate this entirely.

Requesting GPUs in Pods

Once the device plugin is running, pods can request GPUs using the resources.limits field:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: 1  # Request 1 GPU
    command: ["python", "train.py"]
  restartPolicy: Never

Key rules for GPU resource requests:

  • GPUs can only be specified in limits, not requests (they are always equal)
  • GPUs are integers — you cannot request fractional GPUs (without time-slicing or MIG)
  • A container cannot share a GPU with another container (by default)
  • Pods requesting GPUs will only be scheduled on nodes with available GPUs

GPU Operator vs Manual Setup

FeatureGPU OperatorManual Setup
Driver installation✓ AutomaticManual per node
Container runtime✓ Auto-configuredManual configuration
Device plugin✓ ManagedManual DaemonSet
DCGM monitoring✓ IncludedSeparate install
GPU Feature Discovery✓ IncludedSeparate install
Day-2 operations✓ Automated upgradesManual upgrades
Recommendation: Use the NVIDIA GPU Operator for production environments. It simplifies the entire GPU stack lifecycle and supports automated driver upgrades, making operations significantly easier.