Advanced

GPU Cluster Scheduling

Design multi-tenant GPU clusters for ML training with proper scheduling, queuing, preemption, and cost allocation. Learn the Kubernetes-native tools that production teams use to manage hundreds of GPUs across multiple teams.

The GPU Scheduling Problem

GPUs are the most expensive resource in ML infrastructure. An A100 80GB costs ~$2/hour on cloud, meaning an 8-GPU node costs ~$16/hour. A 64-GPU training job burns $128/hour. Without proper scheduling, teams waste GPUs through idle time, fragmentation, and unfair allocation.

The default Kubernetes scheduler handles GPU scheduling poorly because it does not understand gang scheduling (all-or-nothing allocation), fair-share between teams, or job queuing. You need a specialized batch scheduler.

Kubernetes GPU Fundamentals

Before diving into advanced schedulers, understand how Kubernetes handles GPUs at the base level.

# NVIDIA device plugin DaemonSet - required for GPU scheduling
# This runs on every GPU node and exposes GPUs as a Kubernetes resource
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin
  template:
    metadata:
      labels:
        name: nvidia-device-plugin
    spec:
      containers:
      - name: nvidia-device-plugin
        image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      nodeSelector:
        accelerator: "true"

# After deployment, nodes report GPU capacity:
# kubectl describe node gpu-node-1
# Capacity:
#   nvidia.com/gpu: 8
# Allocatable:
#   nvidia.com/gpu: 8

💡

GPU sharing limitation: By default, Kubernetes allocates GPUs as whole units — you cannot request 0.5 GPUs. For GPU sharing (MIG, time-slicing), you need NVIDIA's GPU Operator with MIG support or the time-slicing device plugin. In practice, most training workloads need full GPUs, so this is mainly relevant for development and notebook environments.

Kueue: Kubernetes-Native Job Queuing

Kueue is the official Kubernetes SIG-Scheduling project for batch job queuing. It provides resource quotas, fair-share scheduling, preemption, and priority queues — all natively integrated with Kubernetes.

# Kueue setup: ResourceFlavor, ClusterQueue, and LocalQueue
---
# ResourceFlavor defines GPU types available in the cluster
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: a100-80gb
spec:
  nodeLabels:
    accelerator: nvidia-a100-80gb
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: h100-80gb
spec:
  nodeLabels:
    accelerator: nvidia-h100-80gb
---
# ClusterQueue defines total available resources and fair-share policy
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: ml-training-cluster
spec:
  cohort: ml-org  # Queues in the same cohort can borrow resources
  queueingStrategy: BestEffortFIFO
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: a100-80gb
      resources:
      - name: "nvidia.com/gpu"
        nominalQuota: 64    # 64 A100 GPUs total
        borrowingLimit: 16  # Can borrow up to 16 more from cohort
        lendingLimit: 16    # Can lend up to 16 to cohort
      - name: "cpu"
        nominalQuota: 512
      - name: "memory"
        nominalQuota: 4Ti
    - name: h100-80gb
      resources:
      - name: "nvidia.com/gpu"
        nominalQuota: 32
  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerPriority
  fairSharing:
    weight: 1
---
# LocalQueue for team-A - maps to the ClusterQueue
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: team-a-training
  namespace: team-a
spec:
  clusterQueue: ml-training-cluster
---
# Submit a training job to the queue
apiVersion: batch/v1
kind: Job
metadata:
  name: llama-finetune-v3
  namespace: team-a
  labels:
    kueue.x-k8s.io/queue-name: team-a-training
    kueue.x-k8s.io/priority-class: high-priority
spec:
  parallelism: 2  # 2 pods, each with 8 GPUs = 16 total
  completions: 2
  template:
    spec:
      containers:
      - name: trainer
        image: ml-team/trainer:v3.2
        command: ["torchrun", "--nnodes=2", "--nproc_per_node=8", "train.py"]
        resources:
          requests:
            cpu: "32"
            memory: "256Gi"
            nvidia.com/gpu: "8"
          limits:
            nvidia.com/gpu: "8"
      restartPolicy: OnFailure

Volcano: Gang Scheduling for Distributed Training

Volcano is a CNCF project that adds gang scheduling to Kubernetes. Gang scheduling ensures that all pods for a distributed training job are scheduled simultaneously — without it, you get deadlocks where half the pods are scheduled and waiting for the other half, which cannot be scheduled because the first half is consuming all the GPUs.

# Volcano job for gang-scheduled distributed training
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training-v5
  namespace: ml-training
spec:
  schedulerName: volcano
  minAvailable: 2  # Gang scheduling: ALL 2 tasks must be schedulable
  queue: team-a-queue
  policies:
  - event: PodEvicted
    action: RestartJob
  - event: PodFailed
    action: RestartTask
    timeout: 300  # Restart individual pod if it fails within 5 min
  plugins:
    sla:
      - jobWaitingTime: 30m  # Cancel if not scheduled within 30 min
    env: []
    ssh: []
  tasks:
  - replicas: 2  # 2 nodes
    name: worker
    policies:
    - event: TaskCompleted
      action: CompleteJob
    template:
      spec:
        containers:
        - name: trainer
          image: ml-team/trainer:v3.2-cuda12.1
          command:
          - torchrun
          - --nnodes=2
          - --nproc_per_node=8
          - --rdzv_backend=c10d
          - --rdzv_endpoint=$(VC_WORKER_0_HOST):29400
          - train_fsdp.py
          ports:
          - containerPort: 29400
            name: rdzv
          resources:
            requests:
              cpu: "32"
              memory: "256Gi"
              nvidia.com/gpu: "8"
            limits:
              nvidia.com/gpu: "8"
        restartPolicy: OnFailure
        nodeSelector:
          accelerator: nvidia-a100-80gb

Fair-Share Scheduling and Preemption

In a multi-tenant cluster, you need policies that ensure fair access while maximizing GPU utilization. Here is a practical policy framework:

Policy	When to Apply	Example
Guaranteed Quota	Each team gets a minimum allocation they can always use	Team A: 16 GPUs guaranteed, Team B: 8 GPUs guaranteed
Borrowing	Teams can use idle GPUs beyond their quota	Team A uses 24 GPUs (16 guaranteed + 8 borrowed from idle Team B quota)
Preemption	Reclaim borrowed GPUs when the owner needs them	Team B submits a job → Team A's borrowed pods are preempted (checkpointed and killed)
Priority Classes	Critical production retraining gets priority over experiments	Priority: production-retrain > experiment > development

# Priority classes for ML workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-retrain
value: 1000000
globalDefault: false
description: "Production model retraining - highest priority, preempts experiments"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: experiment-high
value: 100000
description: "High-priority experiments - preempts dev workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: experiment-normal
value: 10000
description: "Normal experiments - no preemption"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: development
value: 1000
preemptionPolicy: Never  # Dev workloads are never preempted
description: "Development and notebooks - lowest priority, uses spare capacity"

Cost Allocation

Tracking GPU costs per team, project, and job is essential for budget management and optimization. Here is a production cost allocation system.

# GPU cost tracking with Prometheus and custom metrics
# Export per-job GPU cost as a Prometheus metric

from prometheus_client import Gauge, start_http_server
import subprocess, json, time

gpu_cost_gauge = Gauge(
    "training_job_cost_usd",
    "Cumulative GPU cost in USD for a training job",
    ["job_name", "team", "gpu_type", "namespace"],
)

# GPU pricing (on-demand, per GPU-hour)
GPU_PRICING = {
    "nvidia-a100-80gb": 2.21,   # AWS p4d.24xlarge / 8
    "nvidia-a100-40gb": 1.65,   # GCP a2-highgpu-1g
    "nvidia-h100-80gb": 3.75,   # AWS p5.48xlarge / 8
    "nvidia-l4":        0.81,   # GCP g2-standard-4
}

def calculate_job_cost(job_name: str, namespace: str) -> dict:
    """Calculate GPU cost for a running/completed Kubernetes job"""
    # Get pod runtime and GPU allocation
    pods = get_job_pods(job_name, namespace)
    total_cost = 0
    total_gpu_hours = 0

    for pod in pods:
        gpu_count = pod["resources"]["nvidia.com/gpu"]
        gpu_type = pod["node_labels"]["accelerator"]
        runtime_hours = pod["runtime_seconds"] / 3600

        gpu_hours = gpu_count * runtime_hours
        cost = gpu_hours * GPU_PRICING.get(gpu_type, 2.00)

        total_gpu_hours += gpu_hours
        total_cost += cost

    return {
        "job_name": job_name,
        "total_gpu_hours": round(total_gpu_hours, 2),
        "total_cost_usd": round(total_cost, 2),
        "cost_per_epoch": round(total_cost / max(epochs_completed, 1), 2),
    }

# Monthly cost report per team
def generate_monthly_report(namespace: str, month: str) -> dict:
    jobs = get_completed_jobs(namespace, month)
    report = {
        "team": namespace,
        "month": month,
        "total_gpu_hours": sum(j["gpu_hours"] for j in jobs),
        "total_cost_usd": sum(j["cost_usd"] for j in jobs),
        "jobs_count": len(jobs),
        "top_5_costly_jobs": sorted(jobs, key=lambda j: -j["cost_usd"])[:5],
        "gpu_utilization_avg": sum(j["gpu_util"] for j in jobs) / len(jobs),
    }
    return report

💡

Spot/preemptible instances: For experiment workloads (not production retraining), use spot instances to cut costs by 60-70%. Combine with checkpointing every 15-30 minutes so you can resume when a spot instance is reclaimed. Kueue and Volcano both support spot-aware scheduling.

What Is Next

With GPU scheduling and cost allocation in place, the next lesson covers CI/CD for ML. You will learn how to automate retraining, validate models in CI, and deploy with confidence using GitHub Actions and Argo Workflows.

← Previous Experiment Tracking & Model Registry Next → CI/CD for ML