Intermediate

Kubernetes for ML Questions

Kubernetes is the de facto orchestration platform for ML workloads at most companies. These 10 questions cover GPU scheduling, operators, job queuing, autoscaling, and resource management — the practical knowledge interviewers expect from AI infrastructure candidates.

Q1: How does Kubernetes handle GPU scheduling? What are the limitations?

💡

Answer: Kubernetes schedules GPUs as extended resources using the NVIDIA device plugin. GPUs are advertised as nvidia.com/gpu: N on each node, and pods request them in their resource spec.

How it works:

The NVIDIA GPU device plugin runs as a DaemonSet on each GPU node
It discovers GPUs via nvidia-smi and reports them to the kubelet
The scheduler treats GPUs as integer resources: a pod requesting 2 GPUs gets exactly 2
GPUs are allocated exclusively — no sharing between pods by default

Limitations:

No fractional GPUs: You cannot request 0.5 GPUs. A small inference model that uses 2 GB of an 80 GB A100 still monopolizes the entire GPU. Solutions: MIG, GPU time-slicing, or virtual GPU (vGPU).
No topology awareness: The default scheduler does not know which GPUs are connected via NVLink vs PCIe. It may schedule a 4-GPU training job across GPUs on different NUMA nodes or PCIe switches, killing inter-GPU bandwidth. Solution: Topology-aware scheduling with the GPU Operator or custom scheduler plugins.
No gang scheduling: Kubernetes schedules pods independently. A distributed training job needing 8 GPUs across 2 nodes might get 4 GPUs on node 1 but wait indefinitely for 4 GPUs on node 2 (resource fragmentation). Solutions: Volcano, Kueue, or custom scheduling.
No GPU memory awareness: The scheduler cannot distinguish between GPU models or available GPU memory. A pod requesting 1 GPU might get an A100 when it only needs a T4. Solution: node labels and node selectors/affinity.

Q2: What is the NVIDIA GPU Operator and why would you use it?

💡

Answer: The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPUs in Kubernetes. Without it, you must manually install and maintain NVIDIA drivers, container runtime, device plugin, DCGM, and MIG manager on every GPU node.

Components managed by the GPU Operator:

NVIDIA Driver: Installs and manages GPU drivers as a container, enabling driver updates without node reboots
NVIDIA Container Toolkit: Enables containers to access GPUs via the nvidia container runtime
NVIDIA Device Plugin: Exposes GPUs to Kubernetes scheduler
DCGM Exporter: Exports GPU metrics to Prometheus for monitoring
MIG Manager: Configures Multi-Instance GPU partitioning
GPU Feature Discovery: Labels nodes with GPU attributes (model, driver version, CUDA version, MIG capability)

Why it matters: In a large cluster with hundreds of GPU nodes, manually managing driver versions and toolkit compatibility across nodes is operationally painful. The GPU Operator treats the entire NVIDIA stack as a Kubernetes-native workload, enabling GitOps-style management and consistent GPU environments across the cluster.

Q3: Compare Volcano and Kueue for ML job scheduling on Kubernetes.

💡

Answer: Both solve the gang scheduling and job queuing problem for ML workloads, but with different philosophies:

Feature	Volcano	Kueue
Approach	Full scheduler replacement (custom scheduler)	Admission controller (works with default scheduler)
Gang scheduling	Native. All pods in a job are scheduled together or not at all.	Supported via admission gating. Pods are held until all resources are available.
Fair sharing	Proportional fair scheduling with DRF (Dominant Resource Fairness)	Resource quotas with ClusterQueues and cohorts for borrowing
Preemption	Priority-based preemption within the scheduler	Priority-based preemption with borrowing and reclaim policies
Job types	Custom VolcanoJob CRD, also supports K8s Jobs, Spark, MPI	Works with native K8s Jobs, Kubeflow training operators, Ray
Maturity	CNCF Incubating. Used at Huawei, Baidu, and several large enterprises	Kubernetes SIG Scheduling project. Used at Google, adopted by GKE
Complexity	Higher: replaces scheduler, manages its own queue	Lower: admission controller pattern, simpler architecture

Recommendation: Use Kueue for most Kubernetes-native ML platforms — it is simpler, integrates with the default scheduler, and is actively developed by the Kubernetes community. Use Volcano when you need advanced scheduling policies (bin-packing, task-level scheduling within a job) or have an existing Volcano deployment.

Q4: How do you implement autoscaling for ML workloads on Kubernetes?

💡

Answer: ML workloads require different autoscaling strategies than web services:

Training workloads (batch jobs):

Typically not autoscaled — training jobs request a fixed number of GPUs for their duration
Cluster Autoscaler: Scales the node pool up when pending pods need GPUs, scales down when nodes are idle. Configure with appropriate scale-down delay (GPU nodes are expensive, but provisioning takes 5–10 minutes)
Karpenter (AWS) / NAP (GKE): More responsive node provisioning. Can select the optimal GPU instance type based on pending pod requirements

Inference workloads (serving):

HPA (Horizontal Pod Autoscaler): Scale inference pods based on GPU utilization, request rate, or queue depth. Challenge: GPU utilization metrics require DCGM and custom metrics adapter.
KEDA (Kubernetes Event-Driven Autoscaling): Scale based on external metrics like message queue depth (Kafka/SQS), HTTP request rate, or custom Prometheus queries
VPA (Vertical Pod Autoscaler): Not recommended for GPU workloads because GPUs are integer resources and VPA cannot scale GPU fractions

Key considerations:

GPU node startup time: 5–10 minutes for cloud instances. Keep warm pools of pre-provisioned nodes for latency-sensitive scaling.
Model loading time: Loading a large model into GPU memory takes 30–120 seconds. Factor this into scaling response time.
Scale-to-zero: Possible for batch inference but challenging for real-time serving due to cold start latency. Use Knative with GPU support for serverless inference with scale-to-zero.

Q5: How do you manage resource quotas and fair sharing for GPU workloads across teams?

💡

Answer: In a shared GPU cluster, resource management is critical to prevent one team from monopolizing expensive GPUs:

Kubernetes ResourceQuotas:

Set per-namespace GPU limits: nvidia.com/gpu: 16 means the team can use at most 16 GPUs
Limitation: quotas are hard limits. If Team A has 16 GPU quota but is only using 4, the other 12 sit idle — they cannot be borrowed by Team B

Kueue ClusterQueues (recommended):

Guaranteed quota: Each team gets a minimum GPU allocation (e.g., Team A: 16, Team B: 8)
Borrowing: When a team is under quota, other teams can borrow their unused GPUs
Preemption: When the owning team needs their GPUs back, borrowed workloads are preempted (lower-priority jobs killed first)
Cohorts: Group ClusterQueues into cohorts that can share resources. Research and production cohorts with different policies.

Priority classes:

Define priority tiers: production-critical (highest), time-sensitive-training, best-effort-experiment (lowest)
Higher priority workloads preempt lower priority ones when resources are scarce
Best-effort jobs run on idle capacity and get preempted immediately when higher-priority jobs arrive

Monitoring and chargeback: Use DCGM metrics + labels to track per-team GPU utilization. Build dashboards showing allocation vs actual usage. Teams with consistently low utilization get their quotas reduced.

Q6: How do you run distributed training jobs on Kubernetes?

💡

Answer: Distributed training on Kubernetes requires coordination between multiple pods. Three main approaches:

1. Kubeflow Training Operator (recommended):

Custom CRDs for different frameworks: PyTorchJob, TFJob, MPIJob
PyTorchJob: creates N worker pods, sets environment variables for torch.distributed (MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK) automatically
Handles pod lifecycle: if one worker fails, the entire job is marked as failed and can be retried
Example: PyTorchJob with 4 workers, each requesting 8 GPUs = 32 GPU distributed training job

2. MPI Operator:

Uses MPI (mpirun/mpiexec) to launch distributed jobs. Better for Horovod-based training.
Creates a launcher pod and N worker pods with SSH access between them
More complex setup but familiar to HPC users

3. Manual setup with StatefulSets or Jobs:

Create a headless service for DNS-based discovery
Use StatefulSet for stable pod names (worker-0, worker-1, etc.)
Set MASTER_ADDR to worker-0's DNS name, configure WORLD_SIZE and RANK via pod ordinal index
More control but more operational burden

Critical requirements for distributed training on K8s:

Host networking: Use hostNetwork: true for NCCL performance — avoids container networking overhead for GPU-to-GPU communication
Shared memory: Mount /dev/shm as a memory-backed volume with sufficient size (default 64MB is too small for PyTorch data loaders)
IPC_LOCK capability: Required for RDMA/InfiniBand support
NCCL environment: Set NCCL_SOCKET_IFNAME to the correct network interface

Q7: What is GPU time-slicing and how does it differ from MIG?

💡

Answer: Both allow multiple workloads to share a single GPU, but with fundamentally different approaches:

GPU Time-Slicing:

The NVIDIA device plugin advertises more "virtual GPUs" than physical GPUs. Example: configure 4 replicas per GPU, so an 8-GPU node advertises 32 nvidia.com/gpu.
Workloads time-share the GPU through CUDA's built-in context switching. Each pod gets a turn to run kernels.
No memory isolation: All pods share the full GPU memory. One pod can OOM-kill others.
No compute isolation: A heavy workload on one slice affects performance of all others.
Best for: Development environments, Jupyter notebooks, small inference models where utilization is low and intermittent.

MIG (Multi-Instance GPU):

Hardware-level partitioning: each instance gets dedicated SMs, memory, and memory bandwidth.
Full isolation: one instance cannot affect another's performance or memory.
Only available on A100, A30, H100. Limited partition profiles (7 instances max).
Best for: Multi-tenant inference, production workloads requiring guaranteed performance.

Interview insight: Time-slicing is a quick win for improving GPU utilization in development clusters (from 10% to 50%+), but never use it for production inference where latency SLAs matter. MIG is the production solution for GPU sharing.

Q8: How do you handle node failures and pod evictions for long-running training jobs?

💡

Answer: Long-running training jobs (hours to weeks) are vulnerable to many failure modes on Kubernetes:

Failure modes:

Node failure: Hardware crash, kernel panic, GPU hardware error. Pod is lost.
Pod eviction: Node pressure (memory, disk), preemption by higher-priority pod, or cluster autoscaler scaling down.
GPU error: Uncorrectable ECC error, GPU fallen off bus (XID 79). Requires GPU replacement.
Network partition: InfiniBand link failure between nodes. NCCL timeout kills all workers.

Mitigation strategies:

Checkpointing: Save model state to persistent storage (PVC or S3/GCS) every 15–30 minutes. Use async checkpointing to minimize training interruption. Test checkpoint restore before relying on it.
Pod Disruption Budgets (PDB): Set minAvailable to prevent voluntary evictions from disrupting training jobs. Protects against cluster autoscaler and node drain operations.
Priority classes: Assign high priority to training jobs so they are not preempted by lower-priority workloads.
TorchElastic / Elastic Horovod: Automatically restart failed workers and resume from checkpoint. Supports dynamic scaling without restarting the entire job.
Job retry with backoff: Configure Kubernetes Job with backoffLimit and restart policy. On failure, the job restarts and resumes from the latest checkpoint.
Health monitoring: Use liveness probes that check GPU health (nvidia-smi, DCGM) and training progress (step counter). Kill and restart stuck pods proactively.

Q9: What is topology-aware scheduling and why does it matter for ML?

💡

Answer: Topology-aware scheduling places pods on GPUs that have optimal interconnect topology for the workload. This is critical because GPU communication performance varies dramatically based on physical connectivity.

Why it matters:

Two GPUs connected via NVLink: 900 GB/s bandwidth, ~1 microsecond latency
Two GPUs connected via PCIe switch: 32 GB/s bandwidth, ~5 microsecond latency
Two GPUs on different NUMA nodes: 16 GB/s effective bandwidth, ~10 microsecond latency
A 4-GPU tensor parallel training job placed on NVLink-connected GPUs runs 3–5x faster than the same job on PCIe-connected GPUs

Implementation:

GPU Feature Discovery + Node Labels: Label nodes with GPU topology information (NVLink connectivity, NUMA affinity). Use node affinity rules to place workloads on nodes with optimal topology.
Topology Aware Scheduler Plugin: NVIDIA's custom scheduler extension that understands GPU-to-GPU connectivity within a node and assigns GPUs that are NVLink-connected to the same pod.
NUMA-aware scheduling: Ensure GPU, CPU, and memory are on the same NUMA node. Crossing NUMA boundaries for data loading adds latency and reduces PCIe throughput.

In practice: For DGX-style nodes with NVSwitch (all GPUs connected), topology is less critical within a node. For heterogeneous servers with mixed NVLink/PCIe connectivity, topology-aware scheduling can make or break training performance.

Q10: How do you manage Kubernetes networking for distributed training?

💡

Answer: Distributed training has unique networking requirements that differ from typical Kubernetes services:

Host networking vs pod networking:

Standard Kubernetes: pods get virtual IPs via CNI (Calico, Flannel, Cilium). Fine for microservices but adds overhead for high-throughput GPU communication.
For distributed training: use hostNetwork: true so pods bypass the overlay network and directly access host network interfaces, including InfiniBand adapters.
Trade-off: host networking means pod port conflicts and loss of network isolation. Acceptable for training jobs, not for multi-tenant services.

InfiniBand / RDMA on Kubernetes:

Install the RDMA device plugin (or Mellanox OFED device plugin) to expose InfiniBand adapters as schedulable resources
Pods request InfiniBand: rdma/hca: 1 in resource spec
Requires IPC_LOCK capability and SYS_RESOURCE for memory locking (RDMA requires pinned memory)
Configure NCCL to use the InfiniBand interface: NCCL_IB_HCA=mlx5

Multi-NIC / Secondary networks:

Use Multus CNI to attach secondary network interfaces (InfiniBand, high-speed Ethernet) to pods alongside the primary Kubernetes network
Training traffic goes over the high-speed secondary network; Kubernetes management traffic (health checks, logs) uses the primary network
This separation prevents training traffic from being impacted by cluster management overhead

Network policies: Do not apply restrictive network policies to distributed training pods — they need to communicate freely on all ports for NCCL. Isolate training namespaces at the network level instead.

← Previous Distributed Training Questions Next → Cloud AI Services Questions