Intermediate

Kubernetes for ML Questions

Kubernetes is the de facto orchestration platform for ML workloads at most companies. These 10 questions cover GPU scheduling, operators, job queuing, autoscaling, and resource management — the practical knowledge interviewers expect from AI infrastructure candidates.

Q1: How does Kubernetes handle GPU scheduling? What are the limitations?

💡

Answer: Kubernetes schedules GPUs as extended resources using the NVIDIA device plugin. GPUs are advertised as nvidia.com/gpu: N on each node, and pods request them in their resource spec.

How it works:

  1. The NVIDIA GPU device plugin runs as a DaemonSet on each GPU node
  2. It discovers GPUs via nvidia-smi and reports them to the kubelet
  3. The scheduler treats GPUs as integer resources: a pod requesting 2 GPUs gets exactly 2
  4. GPUs are allocated exclusively — no sharing between pods by default

Limitations:

  • No fractional GPUs: You cannot request 0.5 GPUs. A small inference model that uses 2 GB of an 80 GB A100 still monopolizes the entire GPU. Solutions: MIG, GPU time-slicing, or virtual GPU (vGPU).
  • No topology awareness: The default scheduler does not know which GPUs are connected via NVLink vs PCIe. It may schedule a 4-GPU training job across GPUs on different NUMA nodes or PCIe switches, killing inter-GPU bandwidth. Solution: Topology-aware scheduling with the GPU Operator or custom scheduler plugins.
  • No gang scheduling: Kubernetes schedules pods independently. A distributed training job needing 8 GPUs across 2 nodes might get 4 GPUs on node 1 but wait indefinitely for 4 GPUs on node 2 (resource fragmentation). Solutions: Volcano, Kueue, or custom scheduling.
  • No GPU memory awareness: The scheduler cannot distinguish between GPU models or available GPU memory. A pod requesting 1 GPU might get an A100 when it only needs a T4. Solution: node labels and node selectors/affinity.

Q2: What is the NVIDIA GPU Operator and why would you use it?

💡

Answer: The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPUs in Kubernetes. Without it, you must manually install and maintain NVIDIA drivers, container runtime, device plugin, DCGM, and MIG manager on every GPU node.

Components managed by the GPU Operator:

  • NVIDIA Driver: Installs and manages GPU drivers as a container, enabling driver updates without node reboots
  • NVIDIA Container Toolkit: Enables containers to access GPUs via the nvidia container runtime
  • NVIDIA Device Plugin: Exposes GPUs to Kubernetes scheduler
  • DCGM Exporter: Exports GPU metrics to Prometheus for monitoring
  • MIG Manager: Configures Multi-Instance GPU partitioning
  • GPU Feature Discovery: Labels nodes with GPU attributes (model, driver version, CUDA version, MIG capability)

Why it matters: In a large cluster with hundreds of GPU nodes, manually managing driver versions and toolkit compatibility across nodes is operationally painful. The GPU Operator treats the entire NVIDIA stack as a Kubernetes-native workload, enabling GitOps-style management and consistent GPU environments across the cluster.

Q3: Compare Volcano and Kueue for ML job scheduling on Kubernetes.

💡

Answer: Both solve the gang scheduling and job queuing problem for ML workloads, but with different philosophies:

FeatureVolcanoKueue
ApproachFull scheduler replacement (custom scheduler)Admission controller (works with default scheduler)
Gang schedulingNative. All pods in a job are scheduled together or not at all.Supported via admission gating. Pods are held until all resources are available.
Fair sharingProportional fair scheduling with DRF (Dominant Resource Fairness)Resource quotas with ClusterQueues and cohorts for borrowing
PreemptionPriority-based preemption within the schedulerPriority-based preemption with borrowing and reclaim policies
Job typesCustom VolcanoJob CRD, also supports K8s Jobs, Spark, MPIWorks with native K8s Jobs, Kubeflow training operators, Ray
MaturityCNCF Incubating. Used at Huawei, Baidu, and several large enterprisesKubernetes SIG Scheduling project. Used at Google, adopted by GKE
ComplexityHigher: replaces scheduler, manages its own queueLower: admission controller pattern, simpler architecture

Recommendation: Use Kueue for most Kubernetes-native ML platforms — it is simpler, integrates with the default scheduler, and is actively developed by the Kubernetes community. Use Volcano when you need advanced scheduling policies (bin-packing, task-level scheduling within a job) or have an existing Volcano deployment.

Q4: How do you implement autoscaling for ML workloads on Kubernetes?

💡

Answer: ML workloads require different autoscaling strategies than web services:

Training workloads (batch jobs):

  • Typically not autoscaled — training jobs request a fixed number of GPUs for their duration
  • Cluster Autoscaler: Scales the node pool up when pending pods need GPUs, scales down when nodes are idle. Configure with appropriate scale-down delay (GPU nodes are expensive, but provisioning takes 5–10 minutes)
  • Karpenter (AWS) / NAP (GKE): More responsive node provisioning. Can select the optimal GPU instance type based on pending pod requirements

Inference workloads (serving):

  • HPA (Horizontal Pod Autoscaler): Scale inference pods based on GPU utilization, request rate, or queue depth. Challenge: GPU utilization metrics require DCGM and custom metrics adapter.
  • KEDA (Kubernetes Event-Driven Autoscaling): Scale based on external metrics like message queue depth (Kafka/SQS), HTTP request rate, or custom Prometheus queries
  • VPA (Vertical Pod Autoscaler): Not recommended for GPU workloads because GPUs are integer resources and VPA cannot scale GPU fractions

Key considerations:

  • GPU node startup time: 5–10 minutes for cloud instances. Keep warm pools of pre-provisioned nodes for latency-sensitive scaling.
  • Model loading time: Loading a large model into GPU memory takes 30–120 seconds. Factor this into scaling response time.
  • Scale-to-zero: Possible for batch inference but challenging for real-time serving due to cold start latency. Use Knative with GPU support for serverless inference with scale-to-zero.

Q5: How do you manage resource quotas and fair sharing for GPU workloads across teams?

💡

Answer: In a shared GPU cluster, resource management is critical to prevent one team from monopolizing expensive GPUs:

Kubernetes ResourceQuotas:

  • Set per-namespace GPU limits: nvidia.com/gpu: 16 means the team can use at most 16 GPUs
  • Limitation: quotas are hard limits. If Team A has 16 GPU quota but is only using 4, the other 12 sit idle — they cannot be borrowed by Team B

Kueue ClusterQueues (recommended):

  • Guaranteed quota: Each team gets a minimum GPU allocation (e.g., Team A: 16, Team B: 8)
  • Borrowing: When a team is under quota, other teams can borrow their unused GPUs
  • Preemption: When the owning team needs their GPUs back, borrowed workloads are preempted (lower-priority jobs killed first)
  • Cohorts: Group ClusterQueues into cohorts that can share resources. Research and production cohorts with different policies.

Priority classes:

  • Define priority tiers: production-critical (highest), time-sensitive-training, best-effort-experiment (lowest)
  • Higher priority workloads preempt lower priority ones when resources are scarce
  • Best-effort jobs run on idle capacity and get preempted immediately when higher-priority jobs arrive

Monitoring and chargeback: Use DCGM metrics + labels to track per-team GPU utilization. Build dashboards showing allocation vs actual usage. Teams with consistently low utilization get their quotas reduced.

Q6: How do you run distributed training jobs on Kubernetes?

💡

Answer: Distributed training on Kubernetes requires coordination between multiple pods. Three main approaches:

1. Kubeflow Training Operator (recommended):

  • Custom CRDs for different frameworks: PyTorchJob, TFJob, MPIJob
  • PyTorchJob: creates N worker pods, sets environment variables for torch.distributed (MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK) automatically
  • Handles pod lifecycle: if one worker fails, the entire job is marked as failed and can be retried
  • Example: PyTorchJob with 4 workers, each requesting 8 GPUs = 32 GPU distributed training job

2. MPI Operator:

  • Uses MPI (mpirun/mpiexec) to launch distributed jobs. Better for Horovod-based training.
  • Creates a launcher pod and N worker pods with SSH access between them
  • More complex setup but familiar to HPC users

3. Manual setup with StatefulSets or Jobs:

  • Create a headless service for DNS-based discovery
  • Use StatefulSet for stable pod names (worker-0, worker-1, etc.)
  • Set MASTER_ADDR to worker-0's DNS name, configure WORLD_SIZE and RANK via pod ordinal index
  • More control but more operational burden

Critical requirements for distributed training on K8s:

  • Host networking: Use hostNetwork: true for NCCL performance — avoids container networking overhead for GPU-to-GPU communication
  • Shared memory: Mount /dev/shm as a memory-backed volume with sufficient size (default 64MB is too small for PyTorch data loaders)
  • IPC_LOCK capability: Required for RDMA/InfiniBand support
  • NCCL environment: Set NCCL_SOCKET_IFNAME to the correct network interface

Q7: What is GPU time-slicing and how does it differ from MIG?

💡

Answer: Both allow multiple workloads to share a single GPU, but with fundamentally different approaches:

GPU Time-Slicing:

  • The NVIDIA device plugin advertises more "virtual GPUs" than physical GPUs. Example: configure 4 replicas per GPU, so an 8-GPU node advertises 32 nvidia.com/gpu.
  • Workloads time-share the GPU through CUDA's built-in context switching. Each pod gets a turn to run kernels.
  • No memory isolation: All pods share the full GPU memory. One pod can OOM-kill others.
  • No compute isolation: A heavy workload on one slice affects performance of all others.
  • Best for: Development environments, Jupyter notebooks, small inference models where utilization is low and intermittent.

MIG (Multi-Instance GPU):

  • Hardware-level partitioning: each instance gets dedicated SMs, memory, and memory bandwidth.
  • Full isolation: one instance cannot affect another's performance or memory.
  • Only available on A100, A30, H100. Limited partition profiles (7 instances max).
  • Best for: Multi-tenant inference, production workloads requiring guaranteed performance.

Interview insight: Time-slicing is a quick win for improving GPU utilization in development clusters (from 10% to 50%+), but never use it for production inference where latency SLAs matter. MIG is the production solution for GPU sharing.

Q8: How do you handle node failures and pod evictions for long-running training jobs?

💡

Answer: Long-running training jobs (hours to weeks) are vulnerable to many failure modes on Kubernetes:

Failure modes:

  • Node failure: Hardware crash, kernel panic, GPU hardware error. Pod is lost.
  • Pod eviction: Node pressure (memory, disk), preemption by higher-priority pod, or cluster autoscaler scaling down.
  • GPU error: Uncorrectable ECC error, GPU fallen off bus (XID 79). Requires GPU replacement.
  • Network partition: InfiniBand link failure between nodes. NCCL timeout kills all workers.

Mitigation strategies:

  • Checkpointing: Save model state to persistent storage (PVC or S3/GCS) every 15–30 minutes. Use async checkpointing to minimize training interruption. Test checkpoint restore before relying on it.
  • Pod Disruption Budgets (PDB): Set minAvailable to prevent voluntary evictions from disrupting training jobs. Protects against cluster autoscaler and node drain operations.
  • Priority classes: Assign high priority to training jobs so they are not preempted by lower-priority workloads.
  • TorchElastic / Elastic Horovod: Automatically restart failed workers and resume from checkpoint. Supports dynamic scaling without restarting the entire job.
  • Job retry with backoff: Configure Kubernetes Job with backoffLimit and restart policy. On failure, the job restarts and resumes from the latest checkpoint.
  • Health monitoring: Use liveness probes that check GPU health (nvidia-smi, DCGM) and training progress (step counter). Kill and restart stuck pods proactively.

Q9: What is topology-aware scheduling and why does it matter for ML?

💡

Answer: Topology-aware scheduling places pods on GPUs that have optimal interconnect topology for the workload. This is critical because GPU communication performance varies dramatically based on physical connectivity.

Why it matters:

  • Two GPUs connected via NVLink: 900 GB/s bandwidth, ~1 microsecond latency
  • Two GPUs connected via PCIe switch: 32 GB/s bandwidth, ~5 microsecond latency
  • Two GPUs on different NUMA nodes: 16 GB/s effective bandwidth, ~10 microsecond latency
  • A 4-GPU tensor parallel training job placed on NVLink-connected GPUs runs 3–5x faster than the same job on PCIe-connected GPUs

Implementation:

  • GPU Feature Discovery + Node Labels: Label nodes with GPU topology information (NVLink connectivity, NUMA affinity). Use node affinity rules to place workloads on nodes with optimal topology.
  • Topology Aware Scheduler Plugin: NVIDIA's custom scheduler extension that understands GPU-to-GPU connectivity within a node and assigns GPUs that are NVLink-connected to the same pod.
  • NUMA-aware scheduling: Ensure GPU, CPU, and memory are on the same NUMA node. Crossing NUMA boundaries for data loading adds latency and reduces PCIe throughput.

In practice: For DGX-style nodes with NVSwitch (all GPUs connected), topology is less critical within a node. For heterogeneous servers with mixed NVLink/PCIe connectivity, topology-aware scheduling can make or break training performance.

Q10: How do you manage Kubernetes networking for distributed training?

💡

Answer: Distributed training has unique networking requirements that differ from typical Kubernetes services:

Host networking vs pod networking:

  • Standard Kubernetes: pods get virtual IPs via CNI (Calico, Flannel, Cilium). Fine for microservices but adds overhead for high-throughput GPU communication.
  • For distributed training: use hostNetwork: true so pods bypass the overlay network and directly access host network interfaces, including InfiniBand adapters.
  • Trade-off: host networking means pod port conflicts and loss of network isolation. Acceptable for training jobs, not for multi-tenant services.

InfiniBand / RDMA on Kubernetes:

  • Install the RDMA device plugin (or Mellanox OFED device plugin) to expose InfiniBand adapters as schedulable resources
  • Pods request InfiniBand: rdma/hca: 1 in resource spec
  • Requires IPC_LOCK capability and SYS_RESOURCE for memory locking (RDMA requires pinned memory)
  • Configure NCCL to use the InfiniBand interface: NCCL_IB_HCA=mlx5

Multi-NIC / Secondary networks:

  • Use Multus CNI to attach secondary network interfaces (InfiniBand, high-speed Ethernet) to pods alongside the primary Kubernetes network
  • Training traffic goes over the high-speed secondary network; Kubernetes management traffic (health checks, logs) uses the primary network
  • This separation prevents training traffic from being impacted by cluster management overhead

Network policies: Do not apply restrictive network policies to distributed training pods — they need to communicate freely on all ports for NCCL. Isolate training namespaces at the network level instead.