GPU Scheduling Advanced

Efficient GPU scheduling is critical for maximizing utilization and minimizing costs. This lesson covers GPU scheduling strategies including NVIDIA MIG for hardware partitioning, time-slicing for sharing GPUs between lightweight workloads, topology-aware scheduling for multi-GPU training performance, and fair-share queuing for multi-tenant environments.

MIG (Multi-Instance GPU) Scheduling

MIG allows partitioning a single GPU (A100, H100) into up to seven isolated instances, each with dedicated compute, memory, and bandwidth:

MIG Profile	Compute	Memory (A100 80GB)	Use Case
1g.10gb	1/7 GPU	10 GB	Small inference, development
2g.20gb	2/7 GPU	20 GB	Medium inference, fine-tuning
3g.40gb	3/7 GPU	40 GB	Large inference, medium training
7g.80gb	Full GPU	80 GB	Large-scale training

GPU Time-Slicing

For workloads that do not need a full GPU, NVIDIA's time-slicing feature allows multiple pods to share a single GPU through temporal multiplexing. Unlike MIG, time-slicing does not provide memory isolation:

Advantages — Works on all NVIDIA GPU generations, easy to configure, good for bursty workloads
Disadvantages — No memory isolation (one workload can OOM-kill another), context switching overhead, unpredictable latency
Best for — Development environments, notebook servers, lightweight inference

Topology-Aware Scheduling

For distributed training, the placement of GPUs relative to each other dramatically affects performance. GPUs connected via NVLink are 5-10x faster for communication than GPUs connected only via PCIe:

Same NVSwitch domain — Highest bandwidth (900 GB/s on H100); prefer for data-parallel training
Same node, different NVSwitch — Still fast, minor penalty; acceptable for most training
Different nodes — Network-bound (InfiniBand 400 Gbps); use only for pipeline parallelism or very large models

Fair-Share GPU Queuing

In multi-tenant environments, implement fair-share scheduling to prevent any single team from monopolizing GPU resources:

Resource quotas — Kubernetes ResourceQuotas to cap GPU allocation per namespace
Priority classes — High-priority production inference preempts lower-priority training jobs
Queue systems — Volcano or Kueue for batch job scheduling with fair-share policies
Gang scheduling — Ensure all GPUs for a distributed training job are allocated simultaneously

Cost Saving: Use spot/preemptible instances for training workloads that checkpoint regularly. Combine with gang scheduling and priority queuing to maximize GPU utilization while maintaining training reliability through automatic restart from checkpoints.

Ready for Best Practices?

The final lesson covers production GPU management patterns including fleet maintenance and cost optimization.

Next: Best Practices →

← Grafana GPU Best Practices →