GPU Scheduling Advanced
Efficient GPU scheduling is critical for maximizing utilization and minimizing costs. This lesson covers GPU scheduling strategies including NVIDIA MIG for hardware partitioning, time-slicing for sharing GPUs between lightweight workloads, topology-aware scheduling for multi-GPU training performance, and fair-share queuing for multi-tenant environments.
MIG (Multi-Instance GPU) Scheduling
MIG allows partitioning a single GPU (A100, H100) into up to seven isolated instances, each with dedicated compute, memory, and bandwidth:
| MIG Profile | Compute | Memory (A100 80GB) | Use Case |
|---|---|---|---|
| 1g.10gb | 1/7 GPU | 10 GB | Small inference, development |
| 2g.20gb | 2/7 GPU | 20 GB | Medium inference, fine-tuning |
| 3g.40gb | 3/7 GPU | 40 GB | Large inference, medium training |
| 7g.80gb | Full GPU | 80 GB | Large-scale training |
GPU Time-Slicing
For workloads that do not need a full GPU, NVIDIA's time-slicing feature allows multiple pods to share a single GPU through temporal multiplexing. Unlike MIG, time-slicing does not provide memory isolation:
- Advantages — Works on all NVIDIA GPU generations, easy to configure, good for bursty workloads
- Disadvantages — No memory isolation (one workload can OOM-kill another), context switching overhead, unpredictable latency
- Best for — Development environments, notebook servers, lightweight inference
Topology-Aware Scheduling
For distributed training, the placement of GPUs relative to each other dramatically affects performance. GPUs connected via NVLink are 5-10x faster for communication than GPUs connected only via PCIe:
- Same NVSwitch domain — Highest bandwidth (900 GB/s on H100); prefer for data-parallel training
- Same node, different NVSwitch — Still fast, minor penalty; acceptable for most training
- Different nodes — Network-bound (InfiniBand 400 Gbps); use only for pipeline parallelism or very large models
Fair-Share GPU Queuing
In multi-tenant environments, implement fair-share scheduling to prevent any single team from monopolizing GPU resources:
- Resource quotas — Kubernetes ResourceQuotas to cap GPU allocation per namespace
- Priority classes — High-priority production inference preempts lower-priority training jobs
- Queue systems — Volcano or Kueue for batch job scheduling with fair-share policies
- Gang scheduling — Ensure all GPUs for a distributed training job are allocated simultaneously
Ready for Best Practices?
The final lesson covers production GPU management patterns including fleet maintenance and cost optimization.
Next: Best Practices →
Lilly Tech Systems