Introduction to GitOps for ML Beginner
GitOps is an operational framework that takes DevOps best practices used for application development—such as version control, collaboration, compliance, and CI/CD—and applies them to infrastructure automation. For ML infrastructure, GitOps provides a declarative, auditable, and reproducible approach to managing the complex lifecycle of training, serving, and monitoring ML systems.
What Is GitOps?
GitOps was coined by Weaveworks in 2017. At its core, GitOps uses Git repositories as the single source of truth for declarative infrastructure and applications. The desired state of the system is described in Git, and automated controllers continuously reconcile the actual state to match the desired state.
The Four Principles of GitOps
- Declarative Configuration
The entire system, including infrastructure and applications, is described declaratively. For ML, this means training jobs, model servers, and pipelines are all defined as code.
- Version Controlled
The desired state is stored in Git, providing a complete audit trail. Every change to ML infrastructure is tracked with who, what, when, and why.
- Automatically Applied
Approved changes are automatically applied to the system. When a pull request merges, the ML infrastructure updates itself without manual intervention.
- Continuously Reconciled
Software agents (like ArgoCD or Flux) ensure the actual state matches the desired state. If someone manually changes a GPU allocation, the controller reverts it.
Why GitOps for ML Infrastructure?
ML infrastructure has unique challenges that make GitOps especially valuable:
- Reproducibility — ML experiments must be reproducible; GitOps ensures the infrastructure state is always known and version-controlled
- Complex dependencies — ML systems have interdependent components (feature stores, training clusters, serving endpoints) that must be coordinated
- GPU resource management — Expensive GPU resources need careful allocation; GitOps provides auditable resource changes
- Model versioning — Model deployments can be tracked alongside infrastructure changes in a unified Git history
- Compliance — Regulated industries require audit trails for every infrastructure change affecting ML models
GitOps vs Traditional CI/CD for ML
| Aspect | Traditional CI/CD | GitOps |
|---|---|---|
| Deployment model | Push-based (CI pipeline pushes to cluster) | Pull-based (controller pulls from Git) |
| Source of truth | CI pipeline state / scripts | Git repository |
| Drift detection | Manual or none | Automatic and continuous |
| Rollback | Re-run previous pipeline | Git revert (instant) |
| Audit trail | CI logs (may expire) | Git history (permanent) |
GitOps Tools Landscape
The two dominant GitOps controllers for Kubernetes are:
- ArgoCD — Full-featured GitOps controller with a rich web UI, application sets for multi-cluster management, and strong RBAC
- Flux — Lightweight, composable GitOps toolkit that integrates with Kustomize and Helm, with image automation controllers
Ready to Set Up ArgoCD?
The next lesson walks through installing and configuring ArgoCD for ML workload management on Kubernetes.
Next: ArgoCD →
Lilly Tech Systems