ArgoCD for ML Infrastructure Intermediate
ArgoCD is a declarative, GitOps continuous delivery tool for Kubernetes. In this lesson, you will learn how to install ArgoCD, configure it for ML workloads, define Application and ApplicationSet resources, set up sync policies and health checks for training jobs and model servers, and manage multi-cluster ML deployments.
Installing ArgoCD
ArgoCD runs as a set of controllers inside your Kubernetes cluster. Install it with a single manifest or Helm chart:
# Create namespace and install ArgoCD kubectl create namespace argocd kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml # Access the ArgoCD API server kubectl port-forward svc/argocd-server -n argocd 8080:443 # Get the initial admin password argocd admin initial-password -n argocd
Defining an ML Application
An ArgoCD Application resource connects a Git repository path to a Kubernetes namespace. For ML workloads, you define separate applications for training infrastructure, serving infrastructure, and monitoring:
apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: ml-model-serving namespace: argocd spec: project: ml-platform source: repoURL: https://github.com/org/ml-infra.git targetRevision: main path: environments/production/serving destination: server: https://kubernetes.default.svc namespace: ml-serving syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true
ApplicationSets for Multi-Environment ML
ApplicationSets allow you to template applications across multiple environments (dev, staging, production) or multiple ML models from a single definition:
apiVersion: argoproj.io/v1alpha1 kind: ApplicationSet metadata: name: ml-models spec: generators: - list: elements: - model: recommendation-v2 env: production gpu: "4" - model: nlp-classifier env: production gpu: "2" template: metadata: name: '{{model}}-{{env}}' spec: source: path: 'models/{{model}}/{{env}}'
Custom Health Checks for ML Resources
ArgoCD needs custom health checks to understand the lifecycle of ML-specific resources like training jobs and inference services:
- Training Jobs — Healthy when the job completes successfully, degraded when pods are in CrashLoopBackOff
- Model Servers — Healthy when readiness probes pass and the model is loaded into memory
- Feature Pipelines — Healthy when the latest pipeline run completed within the expected SLA
Sync Waves for ML Deployments
ML deployments often have ordering dependencies. Use ArgoCD sync waves to ensure resources are created in the correct order:
- Wave 0: Namespaces, ConfigMaps, Secrets
- Wave 1: PersistentVolumeClaims for model artifacts and datasets
- Wave 2: Feature store and data pipeline deployments
- Wave 3: Model serving deployments (depends on feature store readiness)
- Wave 4: Monitoring and alerting resources
Ready to Learn Flux?
The next lesson covers Flux CD as an alternative GitOps controller, with its composable architecture and image automation features.
Next: Flux →