Best Practices Advanced

Running ML workloads on Kubernetes in production requires attention to scheduling efficiency, security hardening, observability, and operational workflows. This lesson provides proven best practices for enterprise ML-on-K8s deployments.

Scheduling Best Practices

  • Use Kueue or Volcano: Don't rely on the default scheduler for GPU job queuing. Use a dedicated batch scheduler.
  • Node affinity and taints: Use taints on GPU nodes to prevent non-GPU workloads from being scheduled there.
  • Topology-aware scheduling: For multi-GPU training, use topology constraints to place pods on nodes with NVLink connectivity.
  • Gang scheduling: Ensure all workers of a distributed training job start together or not at all.
  • Preemption policies: Allow production inference to preempt batch training when resources are scarce.

Security

  • Pod Security Standards: Enforce restricted pod security standards. Run containers as non-root.
  • Network policies: Isolate ML namespaces. Only allow required cross-namespace traffic.
  • Image scanning: Scan all container images for vulnerabilities before deployment.
  • Secrets management: Use external secret stores (Vault, cloud KMS) instead of K8s Secrets for sensitive data.
  • RBAC: Grant least-privilege access. Data scientists should not have cluster-admin.

Monitoring and Observability

  • DCGM Exporter: Deploy NVIDIA DCGM Exporter to expose GPU metrics to Prometheus
  • Custom dashboards: Build Grafana dashboards for GPU utilization, job queue depth, and training metrics
  • Alerting: Alert on GPU underutilization (<30%), job failures, and node health issues
  • Logging: Centralize training job logs with Loki or Elasticsearch for debugging
  • Cost tracking: Use Kubecost or cloud-native tools to track cost per namespace and workload

GitOps for ML Infrastructure

  • ArgoCD or Flux: Manage cluster configuration, namespaces, quotas, and operators via GitOps
  • Version control: Store all K8s manifests in git with pull request reviews
  • Environment promotion: Promote configurations from dev to staging to prod via git branches
  • Drift detection: ArgoCD detects and alerts when cluster state diverges from git

Operational Checklist

  • Install NVIDIA GPU Operator for automated driver and plugin management
  • Configure resource quotas for every team namespace
  • Deploy Kueue for GPU job queuing and fair sharing
  • Set up DCGM Exporter + Prometheus + Grafana for GPU monitoring
  • Implement network policies for namespace isolation
  • Use GitOps for all cluster configuration changes
  • Configure cluster autoscaler with GPU node pool scale-to-zero
  • Run regular security scans on container images and cluster configuration
Key Takeaway: Kubernetes provides a powerful platform for ML workloads, but it requires deliberate configuration for GPU management, multi-tenancy, and observability. Invest in platform tooling early to avoid operational debt as your ML infrastructure grows.