Best Practices Advanced

Running ML workloads on Kubernetes in production requires attention to scheduling efficiency, security hardening, observability, and operational workflows. This lesson provides proven best practices for enterprise ML-on-K8s deployments.

Scheduling Best Practices

Use Kueue or Volcano: Don't rely on the default scheduler for GPU job queuing. Use a dedicated batch scheduler.
Node affinity and taints: Use taints on GPU nodes to prevent non-GPU workloads from being scheduled there.
Topology-aware scheduling: For multi-GPU training, use topology constraints to place pods on nodes with NVLink connectivity.
Gang scheduling: Ensure all workers of a distributed training job start together or not at all.
Preemption policies: Allow production inference to preempt batch training when resources are scarce.

Security

Pod Security Standards: Enforce restricted pod security standards. Run containers as non-root.
Network policies: Isolate ML namespaces. Only allow required cross-namespace traffic.
Image scanning: Scan all container images for vulnerabilities before deployment.
Secrets management: Use external secret stores (Vault, cloud KMS) instead of K8s Secrets for sensitive data.
RBAC: Grant least-privilege access. Data scientists should not have cluster-admin.

Monitoring and Observability

DCGM Exporter: Deploy NVIDIA DCGM Exporter to expose GPU metrics to Prometheus
Custom dashboards: Build Grafana dashboards for GPU utilization, job queue depth, and training metrics
Alerting: Alert on GPU underutilization (<30%), job failures, and node health issues
Logging: Centralize training job logs with Loki or Elasticsearch for debugging
Cost tracking: Use Kubecost or cloud-native tools to track cost per namespace and workload

GitOps for ML Infrastructure

ArgoCD or Flux: Manage cluster configuration, namespaces, quotas, and operators via GitOps
Version control: Store all K8s manifests in git with pull request reviews
Environment promotion: Promote configurations from dev to staging to prod via git branches
Drift detection: ArgoCD detects and alerts when cluster state diverges from git

Operational Checklist

Install NVIDIA GPU Operator for automated driver and plugin management
Configure resource quotas for every team namespace
Deploy Kueue for GPU job queuing and fair sharing
Set up DCGM Exporter + Prometheus + Grafana for GPU monitoring
Implement network policies for namespace isolation
Use GitOps for all cluster configuration changes
Configure cluster autoscaler with GPU node pool scale-to-zero
Run regular security scans on container images and cluster configuration

Key Takeaway: Kubernetes provides a powerful platform for ML workloads, but it requires deliberate configuration for GPU management, multi-tenancy, and observability. Invest in platform tooling early to avoid operational debt as your ML infrastructure grows.

← Operators Course Overview →