Drift Detection Advanced

Drift occurs when the actual state of your ML infrastructure diverges from the desired state defined in Git. In ML environments, drift can be caused by manual kubectl changes, auto-scaling events, operator-managed mutations, or even hardware failures. This lesson covers how GitOps controllers detect drift and how to build automated remediation workflows.

Types of Drift in ML Infrastructure

  • Configuration drift — Resource specs, environment variables, or ConfigMaps are changed outside Git (e.g., someone manually increases GPU count)
  • Image drift — Container images are updated directly in the cluster without updating Git manifests
  • Resource drift — Resources are created or deleted in the cluster without corresponding Git changes
  • Secret drift — Credentials are rotated in the cluster but not updated in the sealed secret repository

ArgoCD Drift Detection

ArgoCD continuously compares the live state of Kubernetes resources against the desired state in Git. When drift is detected, the application status changes to "OutOfSync":

Bash
# Check sync status of all ML applications
argocd app list --project ml-platform

# View drift details for a specific application
argocd app diff ml-model-serving

# Force sync to remediate drift
argocd app sync ml-model-serving --prune

Flux Drift Detection

Flux detects drift during its reconciliation loop. The Kustomization controller compares the last applied configuration with the current cluster state:

Bash
# Check Flux reconciliation status
flux get kustomizations

# Force reconciliation
flux reconcile kustomization ml-serving --with-source

# View events for drift detection
kubectl get events -n flux-system --field-selector reason=ReconciliationSucceeded

Automated Remediation Strategies

  1. Self-healing (auto-sync)

    Configure ArgoCD or Flux to automatically revert drift. Best for production environments where the Git state is always authoritative.

  2. Alert-and-review

    Detect drift and alert the team, but require manual approval before remediation. Best for sensitive ML deployments where drift might be intentional (e.g., emergency scaling).

  3. Drift-to-PR

    Detect drift and automatically create a pull request that either reverts the drift or codifies it. Best for teams that want to review all changes.

Handling Legitimate Drift

Not all drift is bad. Some ML infrastructure changes are intentional but made outside the GitOps workflow:

  • HPA scaling — The Horizontal Pod Autoscaler changes replica counts; use ignoreDifferences in ArgoCD to exclude these fields
  • Operator-managed fields — Operators like the GPU operator may add annotations or labels; exclude operator-managed fields from drift detection
  • Emergency changes — During incidents, engineers may need to make direct changes; establish a process to back-port emergency changes to Git
Important: Self-healing should be carefully configured for ML training jobs. Reverting a running training job because of a minor drift detection can waste hours of GPU compute time. Use sync windows and resource exclusions to protect long-running workloads.

Ready for Best Practices?

The final lesson consolidates everything into production-ready GitOps patterns for ML infrastructure management.

Next: Best Practices →