GPU Cluster Scheduling
Design multi-tenant GPU clusters for ML training with proper scheduling, queuing, preemption, and cost allocation. Learn the Kubernetes-native tools that production teams use to manage hundreds of GPUs across multiple teams.
The GPU Scheduling Problem
GPUs are the most expensive resource in ML infrastructure. An A100 80GB costs ~$2/hour on cloud, meaning an 8-GPU node costs ~$16/hour. A 64-GPU training job burns $128/hour. Without proper scheduling, teams waste GPUs through idle time, fragmentation, and unfair allocation.
The default Kubernetes scheduler handles GPU scheduling poorly because it does not understand gang scheduling (all-or-nothing allocation), fair-share between teams, or job queuing. You need a specialized batch scheduler.
Kubernetes GPU Fundamentals
Before diving into advanced schedulers, understand how Kubernetes handles GPUs at the base level.
# NVIDIA device plugin DaemonSet - required for GPU scheduling
# This runs on every GPU node and exposes GPUs as a Kubernetes resource
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin
template:
metadata:
labels:
name: nvidia-device-plugin
spec:
containers:
- name: nvidia-device-plugin
image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
accelerator: "true"
# After deployment, nodes report GPU capacity:
# kubectl describe node gpu-node-1
# Capacity:
# nvidia.com/gpu: 8
# Allocatable:
# nvidia.com/gpu: 8
Kueue: Kubernetes-Native Job Queuing
Kueue is the official Kubernetes SIG-Scheduling project for batch job queuing. It provides resource quotas, fair-share scheduling, preemption, and priority queues — all natively integrated with Kubernetes.
# Kueue setup: ResourceFlavor, ClusterQueue, and LocalQueue
---
# ResourceFlavor defines GPU types available in the cluster
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: a100-80gb
spec:
nodeLabels:
accelerator: nvidia-a100-80gb
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: h100-80gb
spec:
nodeLabels:
accelerator: nvidia-h100-80gb
---
# ClusterQueue defines total available resources and fair-share policy
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: ml-training-cluster
spec:
cohort: ml-org # Queues in the same cohort can borrow resources
queueingStrategy: BestEffortFIFO
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: a100-80gb
resources:
- name: "nvidia.com/gpu"
nominalQuota: 64 # 64 A100 GPUs total
borrowingLimit: 16 # Can borrow up to 16 more from cohort
lendingLimit: 16 # Can lend up to 16 to cohort
- name: "cpu"
nominalQuota: 512
- name: "memory"
nominalQuota: 4Ti
- name: h100-80gb
resources:
- name: "nvidia.com/gpu"
nominalQuota: 32
preemption:
reclaimWithinCohort: Any
withinClusterQueue: LowerPriority
fairSharing:
weight: 1
---
# LocalQueue for team-A - maps to the ClusterQueue
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: team-a-training
namespace: team-a
spec:
clusterQueue: ml-training-cluster
---
# Submit a training job to the queue
apiVersion: batch/v1
kind: Job
metadata:
name: llama-finetune-v3
namespace: team-a
labels:
kueue.x-k8s.io/queue-name: team-a-training
kueue.x-k8s.io/priority-class: high-priority
spec:
parallelism: 2 # 2 pods, each with 8 GPUs = 16 total
completions: 2
template:
spec:
containers:
- name: trainer
image: ml-team/trainer:v3.2
command: ["torchrun", "--nnodes=2", "--nproc_per_node=8", "train.py"]
resources:
requests:
cpu: "32"
memory: "256Gi"
nvidia.com/gpu: "8"
limits:
nvidia.com/gpu: "8"
restartPolicy: OnFailure
Volcano: Gang Scheduling for Distributed Training
Volcano is a CNCF project that adds gang scheduling to Kubernetes. Gang scheduling ensures that all pods for a distributed training job are scheduled simultaneously — without it, you get deadlocks where half the pods are scheduled and waiting for the other half, which cannot be scheduled because the first half is consuming all the GPUs.
# Volcano job for gang-scheduled distributed training
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-training-v5
namespace: ml-training
spec:
schedulerName: volcano
minAvailable: 2 # Gang scheduling: ALL 2 tasks must be schedulable
queue: team-a-queue
policies:
- event: PodEvicted
action: RestartJob
- event: PodFailed
action: RestartTask
timeout: 300 # Restart individual pod if it fails within 5 min
plugins:
sla:
- jobWaitingTime: 30m # Cancel if not scheduled within 30 min
env: []
ssh: []
tasks:
- replicas: 2 # 2 nodes
name: worker
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- name: trainer
image: ml-team/trainer:v3.2-cuda12.1
command:
- torchrun
- --nnodes=2
- --nproc_per_node=8
- --rdzv_backend=c10d
- --rdzv_endpoint=$(VC_WORKER_0_HOST):29400
- train_fsdp.py
ports:
- containerPort: 29400
name: rdzv
resources:
requests:
cpu: "32"
memory: "256Gi"
nvidia.com/gpu: "8"
limits:
nvidia.com/gpu: "8"
restartPolicy: OnFailure
nodeSelector:
accelerator: nvidia-a100-80gb
Fair-Share Scheduling and Preemption
In a multi-tenant cluster, you need policies that ensure fair access while maximizing GPU utilization. Here is a practical policy framework:
| Policy | When to Apply | Example |
|---|---|---|
| Guaranteed Quota | Each team gets a minimum allocation they can always use | Team A: 16 GPUs guaranteed, Team B: 8 GPUs guaranteed |
| Borrowing | Teams can use idle GPUs beyond their quota | Team A uses 24 GPUs (16 guaranteed + 8 borrowed from idle Team B quota) |
| Preemption | Reclaim borrowed GPUs when the owner needs them | Team B submits a job → Team A's borrowed pods are preempted (checkpointed and killed) |
| Priority Classes | Critical production retraining gets priority over experiments | Priority: production-retrain > experiment > development |
# Priority classes for ML workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production-retrain
value: 1000000
globalDefault: false
description: "Production model retraining - highest priority, preempts experiments"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: experiment-high
value: 100000
description: "High-priority experiments - preempts dev workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: experiment-normal
value: 10000
description: "Normal experiments - no preemption"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: development
value: 1000
preemptionPolicy: Never # Dev workloads are never preempted
description: "Development and notebooks - lowest priority, uses spare capacity"
Cost Allocation
Tracking GPU costs per team, project, and job is essential for budget management and optimization. Here is a production cost allocation system.
# GPU cost tracking with Prometheus and custom metrics
# Export per-job GPU cost as a Prometheus metric
from prometheus_client import Gauge, start_http_server
import subprocess, json, time
gpu_cost_gauge = Gauge(
"training_job_cost_usd",
"Cumulative GPU cost in USD for a training job",
["job_name", "team", "gpu_type", "namespace"],
)
# GPU pricing (on-demand, per GPU-hour)
GPU_PRICING = {
"nvidia-a100-80gb": 2.21, # AWS p4d.24xlarge / 8
"nvidia-a100-40gb": 1.65, # GCP a2-highgpu-1g
"nvidia-h100-80gb": 3.75, # AWS p5.48xlarge / 8
"nvidia-l4": 0.81, # GCP g2-standard-4
}
def calculate_job_cost(job_name: str, namespace: str) -> dict:
"""Calculate GPU cost for a running/completed Kubernetes job"""
# Get pod runtime and GPU allocation
pods = get_job_pods(job_name, namespace)
total_cost = 0
total_gpu_hours = 0
for pod in pods:
gpu_count = pod["resources"]["nvidia.com/gpu"]
gpu_type = pod["node_labels"]["accelerator"]
runtime_hours = pod["runtime_seconds"] / 3600
gpu_hours = gpu_count * runtime_hours
cost = gpu_hours * GPU_PRICING.get(gpu_type, 2.00)
total_gpu_hours += gpu_hours
total_cost += cost
return {
"job_name": job_name,
"total_gpu_hours": round(total_gpu_hours, 2),
"total_cost_usd": round(total_cost, 2),
"cost_per_epoch": round(total_cost / max(epochs_completed, 1), 2),
}
# Monthly cost report per team
def generate_monthly_report(namespace: str, month: str) -> dict:
jobs = get_completed_jobs(namespace, month)
report = {
"team": namespace,
"month": month,
"total_gpu_hours": sum(j["gpu_hours"] for j in jobs),
"total_cost_usd": sum(j["cost_usd"] for j in jobs),
"jobs_count": len(jobs),
"top_5_costly_jobs": sorted(jobs, key=lambda j: -j["cost_usd"])[:5],
"gpu_utilization_avg": sum(j["gpu_util"] for j in jobs) / len(jobs),
}
return report
What Is Next
With GPU scheduling and cost allocation in place, the next lesson covers CI/CD for ML. You will learn how to automate retraining, validate models in CI, and deploy with confidence using GitHub Actions and Argo Workflows.
Lilly Tech Systems