ML Manifests Intermediate

Well-structured Kubernetes manifests are the foundation of GitOps for ML. This lesson covers how to design declarative configurations for ML training jobs, model serving endpoints, feature pipelines, and GPU resource allocation that work seamlessly with ArgoCD or Flux.

Training Job Manifests

ML training jobs are typically represented as Kubernetes Jobs or custom resources like PyTorchJob or TFJob. A GitOps-friendly training manifest includes resource requests, tolerations for GPU nodes, and volume mounts for datasets:

YAML
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: recommendation-model-v3
  labels:
    app.kubernetes.io/managed-by: argocd
    ml.platform/team: recommendations
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: registry.internal/ml/rec-model:v3.2.1
              resources:
                limits:
                  nvidia.com/gpu: 4
                  memory: 64Gi
              volumeMounts:
                - name: training-data
                  mountPath: /data

Model Serving Manifests

Model serving can use KServe InferenceService or standard Kubernetes Deployments with autoscaling. The key considerations are readiness probes, resource limits, and canary deployment configurations:

  • Readiness probes — Ensure the model is fully loaded before receiving traffic
  • Resource limits — Specify GPU memory and compute requirements for inference
  • HPA configuration — Scale based on request latency or queue depth, not just CPU
  • Canary rollouts — Gradually shift traffic to new model versions

Repository Structure for ML Manifests

Organize your Git repository to separate base configurations from environment-specific overlays:

Text
ml-infrastructure/
  base/
    training/          # Base training job templates
    serving/           # Base model serving configs
    monitoring/        # Base monitoring stack
  overlays/
    dev/               # Dev environment patches
    staging/           # Staging environment patches
    production/        # Production environment patches
  models/
    recommendation/    # Model-specific configs
    nlp-classifier/    # Model-specific configs
  kustomization.yaml

ConfigMaps and Secrets for ML

ML workloads frequently need configuration for hyperparameters, model registry endpoints, and credentials. Use ConfigMaps for non-sensitive configuration and Sealed Secrets or SOPS for encrypted secrets in Git:

  • Hyperparameters — Store as ConfigMaps; version changes are tracked in Git
  • Model registry URLs — ConfigMap referencing the artifact store endpoint
  • Cloud credentials — Sealed Secrets or external secrets operator for S3, GCS access
  • API keys — Never store plain-text secrets in Git; use encryption-at-rest
Best Practice: Use Kustomize's configMapGenerator and secretGenerator to create unique ConfigMap/Secret names based on content hashes. This ensures pods automatically restart when configuration changes, maintaining consistency between the Git state and running workloads.

Ready to Learn Drift Detection?

The next lesson covers detecting and remediating configuration drift in ML infrastructure, a critical capability for maintaining system reliability.

Next: Drift Detection →