Best Practices Advanced
Running ML workloads on Kubernetes in production requires attention to scheduling efficiency, security hardening, observability, and operational workflows. This lesson provides proven best practices for enterprise ML-on-K8s deployments.
Scheduling Best Practices
- Use Kueue or Volcano: Don't rely on the default scheduler for GPU job queuing. Use a dedicated batch scheduler.
- Node affinity and taints: Use taints on GPU nodes to prevent non-GPU workloads from being scheduled there.
- Topology-aware scheduling: For multi-GPU training, use topology constraints to place pods on nodes with NVLink connectivity.
- Gang scheduling: Ensure all workers of a distributed training job start together or not at all.
- Preemption policies: Allow production inference to preempt batch training when resources are scarce.
Security
- Pod Security Standards: Enforce restricted pod security standards. Run containers as non-root.
- Network policies: Isolate ML namespaces. Only allow required cross-namespace traffic.
- Image scanning: Scan all container images for vulnerabilities before deployment.
- Secrets management: Use external secret stores (Vault, cloud KMS) instead of K8s Secrets for sensitive data.
- RBAC: Grant least-privilege access. Data scientists should not have cluster-admin.
Monitoring and Observability
- DCGM Exporter: Deploy NVIDIA DCGM Exporter to expose GPU metrics to Prometheus
- Custom dashboards: Build Grafana dashboards for GPU utilization, job queue depth, and training metrics
- Alerting: Alert on GPU underutilization (<30%), job failures, and node health issues
- Logging: Centralize training job logs with Loki or Elasticsearch for debugging
- Cost tracking: Use Kubecost or cloud-native tools to track cost per namespace and workload
GitOps for ML Infrastructure
- ArgoCD or Flux: Manage cluster configuration, namespaces, quotas, and operators via GitOps
- Version control: Store all K8s manifests in git with pull request reviews
- Environment promotion: Promote configurations from dev to staging to prod via git branches
- Drift detection: ArgoCD detects and alerts when cluster state diverges from git
Operational Checklist
- Install NVIDIA GPU Operator for automated driver and plugin management
- Configure resource quotas for every team namespace
- Deploy Kueue for GPU job queuing and fair sharing
- Set up DCGM Exporter + Prometheus + Grafana for GPU monitoring
- Implement network policies for namespace isolation
- Use GitOps for all cluster configuration changes
- Configure cluster autoscaler with GPU node pool scale-to-zero
- Run regular security scans on container images and cluster configuration
Key Takeaway: Kubernetes provides a powerful platform for ML workloads, but it requires deliberate configuration for GPU management, multi-tenancy, and observability. Invest in platform tooling early to avoid operational debt as your ML infrastructure grows.
Lilly Tech Systems