ML Manifests Intermediate
Well-structured Kubernetes manifests are the foundation of GitOps for ML. This lesson covers how to design declarative configurations for ML training jobs, model serving endpoints, feature pipelines, and GPU resource allocation that work seamlessly with ArgoCD or Flux.
Training Job Manifests
ML training jobs are typically represented as Kubernetes Jobs or custom resources like PyTorchJob or TFJob. A GitOps-friendly training manifest includes resource requests, tolerations for GPU nodes, and volume mounts for datasets:
apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: recommendation-model-v3 labels: app.kubernetes.io/managed-by: argocd ml.platform/team: recommendations spec: pytorchReplicaSpecs: Master: replicas: 1 template: spec: containers: - name: pytorch image: registry.internal/ml/rec-model:v3.2.1 resources: limits: nvidia.com/gpu: 4 memory: 64Gi volumeMounts: - name: training-data mountPath: /data
Model Serving Manifests
Model serving can use KServe InferenceService or standard Kubernetes Deployments with autoscaling. The key considerations are readiness probes, resource limits, and canary deployment configurations:
- Readiness probes — Ensure the model is fully loaded before receiving traffic
- Resource limits — Specify GPU memory and compute requirements for inference
- HPA configuration — Scale based on request latency or queue depth, not just CPU
- Canary rollouts — Gradually shift traffic to new model versions
Repository Structure for ML Manifests
Organize your Git repository to separate base configurations from environment-specific overlays:
ml-infrastructure/
base/
training/ # Base training job templates
serving/ # Base model serving configs
monitoring/ # Base monitoring stack
overlays/
dev/ # Dev environment patches
staging/ # Staging environment patches
production/ # Production environment patches
models/
recommendation/ # Model-specific configs
nlp-classifier/ # Model-specific configs
kustomization.yaml
ConfigMaps and Secrets for ML
ML workloads frequently need configuration for hyperparameters, model registry endpoints, and credentials. Use ConfigMaps for non-sensitive configuration and Sealed Secrets or SOPS for encrypted secrets in Git:
- Hyperparameters — Store as ConfigMaps; version changes are tracked in Git
- Model registry URLs — ConfigMap referencing the artifact store endpoint
- Cloud credentials — Sealed Secrets or external secrets operator for S3, GCS access
- API keys — Never store plain-text secrets in Git; use encryption-at-rest
Ready to Learn Drift Detection?
The next lesson covers detecting and remediating configuration drift in ML infrastructure, a critical capability for maintaining system reliability.
Next: Drift Detection →