Drift Detection Advanced
Drift occurs when the actual state of your ML infrastructure diverges from the desired state defined in Git. In ML environments, drift can be caused by manual kubectl changes, auto-scaling events, operator-managed mutations, or even hardware failures. This lesson covers how GitOps controllers detect drift and how to build automated remediation workflows.
Types of Drift in ML Infrastructure
- Configuration drift — Resource specs, environment variables, or ConfigMaps are changed outside Git (e.g., someone manually increases GPU count)
- Image drift — Container images are updated directly in the cluster without updating Git manifests
- Resource drift — Resources are created or deleted in the cluster without corresponding Git changes
- Secret drift — Credentials are rotated in the cluster but not updated in the sealed secret repository
ArgoCD Drift Detection
ArgoCD continuously compares the live state of Kubernetes resources against the desired state in Git. When drift is detected, the application status changes to "OutOfSync":
# Check sync status of all ML applications argocd app list --project ml-platform # View drift details for a specific application argocd app diff ml-model-serving # Force sync to remediate drift argocd app sync ml-model-serving --prune
Flux Drift Detection
Flux detects drift during its reconciliation loop. The Kustomization controller compares the last applied configuration with the current cluster state:
# Check Flux reconciliation status flux get kustomizations # Force reconciliation flux reconcile kustomization ml-serving --with-source # View events for drift detection kubectl get events -n flux-system --field-selector reason=ReconciliationSucceeded
Automated Remediation Strategies
- Self-healing (auto-sync)
Configure ArgoCD or Flux to automatically revert drift. Best for production environments where the Git state is always authoritative.
- Alert-and-review
Detect drift and alert the team, but require manual approval before remediation. Best for sensitive ML deployments where drift might be intentional (e.g., emergency scaling).
- Drift-to-PR
Detect drift and automatically create a pull request that either reverts the drift or codifies it. Best for teams that want to review all changes.
Handling Legitimate Drift
Not all drift is bad. Some ML infrastructure changes are intentional but made outside the GitOps workflow:
- HPA scaling — The Horizontal Pod Autoscaler changes replica counts; use
ignoreDifferencesin ArgoCD to exclude these fields - Operator-managed fields — Operators like the GPU operator may add annotations or labels; exclude operator-managed fields from drift detection
- Emergency changes — During incidents, engineers may need to make direct changes; establish a process to back-port emergency changes to Git
Ready for Best Practices?
The final lesson consolidates everything into production-ready GitOps patterns for ML infrastructure management.
Next: Best Practices →
Lilly Tech Systems