ArgoCD for ML Infrastructure Intermediate

ArgoCD is a declarative, GitOps continuous delivery tool for Kubernetes. In this lesson, you will learn how to install ArgoCD, configure it for ML workloads, define Application and ApplicationSet resources, set up sync policies and health checks for training jobs and model servers, and manage multi-cluster ML deployments.

Installing ArgoCD

ArgoCD runs as a set of controllers inside your Kubernetes cluster. Install it with a single manifest or Helm chart:

Bash
# Create namespace and install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Access the ArgoCD API server
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Get the initial admin password
argocd admin initial-password -n argocd

Defining an ML Application

An ArgoCD Application resource connects a Git repository path to a Kubernetes namespace. For ML workloads, you define separate applications for training infrastructure, serving infrastructure, and monitoring:

YAML
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-model-serving
  namespace: argocd
spec:
  project: ml-platform
  source:
    repoURL: https://github.com/org/ml-infra.git
    targetRevision: main
    path: environments/production/serving
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-serving
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

ApplicationSets for Multi-Environment ML

ApplicationSets allow you to template applications across multiple environments (dev, staging, production) or multiple ML models from a single definition:

YAML
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: ml-models
spec:
  generators:
    - list:
        elements:
          - model: recommendation-v2
            env: production
            gpu: "4"
          - model: nlp-classifier
            env: production
            gpu: "2"
  template:
    metadata:
      name: '{{model}}-{{env}}'
    spec:
      source:
        path: 'models/{{model}}/{{env}}'

Custom Health Checks for ML Resources

ArgoCD needs custom health checks to understand the lifecycle of ML-specific resources like training jobs and inference services:

  • Training Jobs — Healthy when the job completes successfully, degraded when pods are in CrashLoopBackOff
  • Model Servers — Healthy when readiness probes pass and the model is loaded into memory
  • Feature Pipelines — Healthy when the latest pipeline run completed within the expected SLA

Sync Waves for ML Deployments

ML deployments often have ordering dependencies. Use ArgoCD sync waves to ensure resources are created in the correct order:

  • Wave 0: Namespaces, ConfigMaps, Secrets
  • Wave 1: PersistentVolumeClaims for model artifacts and datasets
  • Wave 2: Feature store and data pipeline deployments
  • Wave 3: Model serving deployments (depends on feature store readiness)
  • Wave 4: Monitoring and alerting resources
Production Tip: Use ArgoCD notifications to send Slack or email alerts when model deployments sync, fail, or drift. This keeps the ML team aware of infrastructure changes without requiring them to watch the ArgoCD dashboard.

Ready to Learn Flux?

The next lesson covers Flux CD as an alternative GitOps controller, with its composable architecture and image automation features.

Next: Flux →