GPU Scheduling Advanced

Efficient GPU scheduling is critical for maximizing utilization and minimizing costs. This lesson covers GPU scheduling strategies including NVIDIA MIG for hardware partitioning, time-slicing for sharing GPUs between lightweight workloads, topology-aware scheduling for multi-GPU training performance, and fair-share queuing for multi-tenant environments.

MIG (Multi-Instance GPU) Scheduling

MIG allows partitioning a single GPU (A100, H100) into up to seven isolated instances, each with dedicated compute, memory, and bandwidth:

MIG Profile Compute Memory (A100 80GB) Use Case
1g.10gb 1/7 GPU 10 GB Small inference, development
2g.20gb 2/7 GPU 20 GB Medium inference, fine-tuning
3g.40gb 3/7 GPU 40 GB Large inference, medium training
7g.80gb Full GPU 80 GB Large-scale training

GPU Time-Slicing

For workloads that do not need a full GPU, NVIDIA's time-slicing feature allows multiple pods to share a single GPU through temporal multiplexing. Unlike MIG, time-slicing does not provide memory isolation:

  • Advantages — Works on all NVIDIA GPU generations, easy to configure, good for bursty workloads
  • Disadvantages — No memory isolation (one workload can OOM-kill another), context switching overhead, unpredictable latency
  • Best for — Development environments, notebook servers, lightweight inference

Topology-Aware Scheduling

For distributed training, the placement of GPUs relative to each other dramatically affects performance. GPUs connected via NVLink are 5-10x faster for communication than GPUs connected only via PCIe:

  • Same NVSwitch domain — Highest bandwidth (900 GB/s on H100); prefer for data-parallel training
  • Same node, different NVSwitch — Still fast, minor penalty; acceptable for most training
  • Different nodes — Network-bound (InfiniBand 400 Gbps); use only for pipeline parallelism or very large models

Fair-Share GPU Queuing

In multi-tenant environments, implement fair-share scheduling to prevent any single team from monopolizing GPU resources:

  • Resource quotas — Kubernetes ResourceQuotas to cap GPU allocation per namespace
  • Priority classes — High-priority production inference preempts lower-priority training jobs
  • Queue systems — Volcano or Kueue for batch job scheduling with fair-share policies
  • Gang scheduling — Ensure all GPUs for a distributed training job are allocated simultaneously
Cost Saving: Use spot/preemptible instances for training workloads that checkpoint regularly. Combine with gang scheduling and priority queuing to maximize GPU utilization while maintaining training reliability through automatic restart from checkpoints.

Ready for Best Practices?

The final lesson covers production GPU management patterns including fleet maintenance and cost optimization.

Next: Best Practices →