Intermediate

GPU Scheduling

How to manage GPU resources in Kubernetes — install NVIDIA device plugins, request GPUs in Pod specs, use node affinity and taints to direct workloads to GPU nodes.

How GPUs Work in Kubernetes

Kubernetes does not natively understand GPUs. It relies on device plugins to discover and advertise GPU resources on nodes. The NVIDIA device plugin is the most common, making NVIDIA GPUs available as schedulable resources.

The GPU Stack

Hardware: NVIDIA GPU installed in the node (A100, V100, T4, etc.)
Driver: NVIDIA GPU driver installed on the host OS
Container runtime: NVIDIA Container Toolkit (nvidia-docker) enables GPU access inside containers
Device plugin: NVIDIA device plugin DaemonSet runs on each GPU node and registers GPUs with the kubelet
Pod spec: Request GPUs using nvidia.com/gpu in resource requests

Installing the NVIDIA Device Plugin

The device plugin runs as a DaemonSet, ensuring one Pod per GPU node.

# Deploy the NVIDIA device plugin DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

# Verify GPUs are discovered
kubectl get nodes -o json | jq '.items[].status.capacity'
# Look for: "nvidia.com/gpu": "4"

💡

Key point: GPUs cannot be shared between Pods by default. If a node has 4 GPUs, at most 4 Pods can use GPUs on that node (1 GPU each). GPU time-slicing and MIG (Multi-Instance GPU) are newer features that allow sharing, but the CKA focuses on the default behavior.

Requesting GPUs in Pod Specs

Request GPUs using the extended resource nvidia.com/gpu in the container's resource section.

# Pod requesting 2 GPUs for training
apiVersion: v1
kind: Pod
metadata:
  name: gpu-training
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
    command: ["python", "train.py", "--gpus", "2"]
    resources:
      limits:
        nvidia.com/gpu: 2
        memory: "32Gi"
        cpu: "8"

⚠

Important: For GPU resources, you only need to specify limits (not requests). Kubernetes automatically sets the request equal to the limit for extended resources. GPU resources are always whole numbers — you cannot request 0.5 GPUs.

Node Affinity for GPU Nodes

Node affinity rules ensure Pods are scheduled on nodes with specific characteristics. Use this to direct ML workloads to GPU-equipped nodes.

# Pod with node affinity for GPU nodes
apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu-type
            operator: In
            values:
            - a100
            - v100
  containers:
  - name: trainer
    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
    resources:
      limits:
        nvidia.com/gpu: 1

Node Labels for GPU Management

# Label GPU nodes with GPU type
kubectl label node gpu-node-1 gpu-type=a100
kubectl label node gpu-node-2 gpu-type=v100
kubectl label node gpu-node-3 gpu-type=t4

# Label by GPU memory
kubectl label node gpu-node-1 gpu-memory=80Gi
kubectl label node gpu-node-2 gpu-memory=32Gi

Taints and Tolerations

Taints prevent non-GPU workloads from being scheduled on expensive GPU nodes. Tolerations allow specific Pods to be scheduled on tainted nodes.

# Taint GPU nodes to repel non-GPU workloads
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

# Pod with toleration for GPU taint
apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-tolerant
spec:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"
  containers:
  - name: trainer
    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
    resources:
      limits:
        nvidia.com/gpu: 1

💡

Best practice: Use taints + tolerations together with node affinity. Taints repel unwanted Pods, while node affinity attracts desired Pods. This two-way mechanism ensures GPU nodes are used exclusively for ML workloads.

Resource Quotas for GPUs

Limit GPU usage per namespace to prevent a single team from monopolizing all GPUs.

# GPU quota for the training namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-training
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"

Monitoring GPU Utilization

nvidia-smi — Run inside a Pod to check GPU utilization, memory usage, and temperature
DCGM Exporter — NVIDIA Data Center GPU Manager exports GPU metrics to Prometheus
kubectl describe node — Shows allocated vs. allocatable GPU resources per node

# Check GPU allocation on a node
kubectl describe node gpu-node-1 | grep -A 5 "Allocated resources"
# Shows: nvidia.com/gpu  2 (50%) / 4

Practice Questions

📝

Q1: A training Pod requesting nvidia.com/gpu: 1 is stuck in Pending state. The cluster has GPU nodes with available GPUs. What is the most likely cause?

A) The NVIDIA device plugin DaemonSet is not running on the GPU nodes
B) The Pod is missing a readiness probe
C) The GPU nodes have insufficient CPU
D) The Pod needs a Service to be created first

Show Answer

A) The NVIDIA device plugin DaemonSet is not running on the GPU nodes. Without the device plugin, the kubelet does not know about GPU resources, so nvidia.com/gpu is not advertised as an allocatable resource. The scheduler cannot find a node with available GPUs, leaving the Pod in Pending state. Verify with kubectl describe node and check for nvidia.com/gpu in the allocatable resources.

📝

Q2: You want to ensure that only ML training Pods run on expensive GPU nodes and all other workloads are scheduled elsewhere. Which combination of Kubernetes features should you use?

A) Labels and annotations
B) Taints and tolerations
C) ResourceQuotas and LimitRanges
D) NetworkPolicies and Services

Show Answer

B) Taints and tolerations. Taint the GPU nodes so that only Pods with matching tolerations can be scheduled there. Add the toleration to your ML training Pod specs. This prevents non-ML workloads (web servers, databases, etc.) from consuming GPU node resources.

📝

Q3: A data scientist requests 0.5 GPUs for a lightweight inference task. What happens when this Pod spec is submitted?

A) The Pod is scheduled and receives half a GPU
B) The Pod is scheduled and receives one full GPU
C) The Pod fails validation because GPU requests must be whole numbers
D) The Pod is scheduled with GPU time-slicing enabled automatically

Show Answer

C) The Pod fails validation because GPU requests must be whole numbers. Extended resources like nvidia.com/gpu must be requested in whole numbers. You cannot request fractional GPUs through the standard Kubernetes resource model. GPU sharing requires additional configuration like NVIDIA MIG or GPU time-slicing, which are not standard CKA topics.

📝

Q4: You need to schedule a training job that requires A100 GPUs specifically (not V100 or T4). Nodes are labeled with gpu-type=a100, gpu-type=v100, or gpu-type=t4. Which scheduling feature should you use?

A) PriorityClass
B) Pod topology spread constraints
C) Node affinity with requiredDuringSchedulingIgnoredDuringExecution
D) Pod affinity

Show Answer

C) Node affinity with requiredDuringSchedulingIgnoredDuringExecution. Node affinity allows you to constrain which nodes a Pod can be scheduled on based on node labels. Using requiredDuringScheduling makes it a hard requirement — the Pod will only be scheduled on nodes where gpu-type=a100. Pod affinity is for co-locating Pods with other Pods, not for targeting specific node types.

📝

Q5: Which Kubernetes component is responsible for discovering GPU resources on a node and reporting them to the kubelet?

A) kube-scheduler
B) kube-proxy
C) NVIDIA device plugin
D) Container runtime

Show Answer

C) NVIDIA device plugin. The device plugin runs as a DaemonSet on GPU nodes. It discovers the GPUs on the node, registers them with the kubelet via the device plugin framework, and makes them available as extended resources (nvidia.com/gpu). The scheduler then uses this information when placing Pods.

← Previous Core Concepts Next → ML Workloads