Guide
GitOps explained
Traditional deployment pipelines push changes to production: a CI job
runs kubectl apply or calls a cloud API with credentials that
can modify anything. GitOps flips the direction. You declare
the desired state in Git — Kubernetes manifests, Helm charts, Kustomize
overlays, Terraform modules — and a controller running inside the
cluster continuously pulls that state and reconciles reality to match. Git
becomes the single source of truth; every change is a pull request with
review, audit trail, and instant rollback via git revert.
This guide covers the core principles, how pull-based reconciliation works,
popular tools (Argo CD and Flux), repository layout, secrets patterns,
drift detection, environment promotion, and when GitOps beats push-based
CI/CD
— with links to
Kubernetes fundamentals
and
infrastructure as code
for the surrounding stack.
The four GitOps principles
The OpenGitOps project distills GitOps into four principles that every implementation should satisfy:
- Declarative. The entire system is described declaratively — YAML manifests, Helm values, Terraform HCL. You state what you want, not imperative shell scripts that mutate state step by step.
- Versioned and immutable. That declarative description lives in Git (or another version-control system). Every change has an author, timestamp, diff, and optional code review. Tags and branches represent releases.
- Applied automatically. Software agents (Argo CD,
Flux, Terraform Cloud) pull the declared state and apply it without
manual
kubectlsessions. Humans approve PRs; machines deploy. - Continuously reconciled. Agents do not just apply once — they watch for drift. If someone hand-edits a Deployment replica count in the live cluster, the controller detects the mismatch and either reverts to Git or raises an alert.
Together, these principles turn operations into a software engineering workflow: the same branching, review, and rollback habits that keep application code safe now protect production infrastructure.
Push vs pull: why direction matters
In a push-based pipeline, your CI server holds cluster
admin credentials. A green build triggers helm upgrade or
terraform apply from outside the cluster. This works for
small teams but creates problems at scale:
- Credential sprawl. Every CI job, script, and engineer who can trigger a deploy needs powerful API keys. Leaked Jenkins credentials have caused real breaches.
- No continuous drift detection. Push pipelines run on
events (merge to main). Between deploys, manual
kubectl editchanges go unnoticed until the next push overwrites them — or never gets corrected. - Audit gaps. "Who changed replica count to 50?" might be answered by CI logs, cluster audit logs, or a Slack message — three different sources.
In a pull-based model, only the in-cluster GitOps controller has write access to the API server. CI builds and pushes a container image; it updates a manifest in Git (image tag bump) and stops. Argo CD or Flux notices the commit, pulls the new manifest, and applies it. The blast radius of a compromised CI pipeline shrinks dramatically — attackers can change what Git says should run, but they cannot directly shell into production unless they also merge to the protected branch.
The reconciliation loop
Whether you use Argo CD, Flux, or a homegrown operator, the core loop is identical:
- Observe — read the desired state from Git (a branch, tag, or semver range).
- Compare — diff desired state against live cluster resources (Deployments, Services, ConfigMaps, CRDs).
- Act — create, update, or prune resources to converge.
- Report — surface sync status, health, and diffs in a UI or metrics endpoint.
Sync policies control automation level:
- Manual sync — operator clicks "Sync" in the Argo CD UI after reviewing the diff. Safest for early adoption.
- Auto-sync — controller applies Git changes within seconds of merge. Standard for staging; use with care in production.
- Self-heal — when enabled, the controller reverts manual cluster edits back to Git. Powerful for preventing configuration drift; frustrating if on-call engineers cannot kubectl-patch during an incident without disabling self-heal first.
- Prune — delete cluster resources that no longer exist in Git. Essential for clean teardowns; dangerous if someone deletes a manifest by accident.
Pair auto-sync with branch protection and required reviewers. The controller is only as safe as the merge policy guarding the branch it watches.
Argo CD vs Flux: choosing a controller
Both are CNCF-graduated, Kubernetes-native, and production-proven. The choice is often organizational rather than technical:
Argo CD
- Rich web UI showing application topology, sync status, and live diffs — excellent for teams new to GitOps.
- Application CRD wraps Helm, Kustomize, plain YAML, Jsonnet, and plugins in one abstraction.
- Multi-cluster management via Argo CD managing remote clusters from a central instance.
- Part of the broader Argo suite (Workflows, Events, Rollouts) for progressive delivery.
Flux CD
- Modular controllers: source-controller fetches Git/OCI artifacts; kustomize-controller and helm-controller apply them. Compose only what you need.
- Native HelmRelease and Kustomization CRDs — no wrapper Application object.
- Strong OCI artifact support (push Helm charts to container registries).
- Lighter UI footprint; status via
flux getCLI and Prometheus metrics. GitOps Toolkit integrates cleanly with observability stacks.
Many enterprises run both: Argo CD for application teams who want a UI, Flux for platform teams managing cluster add-ons. Either beats shell scripts in CI.
Repository layout and environment promotion
How you structure Git repos determines how painful promotion from staging to production becomes. Three common patterns:
Monorepo with overlay directories
apps/
payment-api/
base/ # shared Deployment + Service
overlays/
staging/ # 1 replica, staging image tag
production/ # 3 replicas, prod image tag
Kustomize patches per environment. Argo CD Application points staging at
overlays/staging, production at
overlays/production. Promotion = PR that copies or retags
the image reference in the production overlay.
Repo-per-environment
Separate infra-staging and infra-production
repos. Promotion is a PR from staging repo to production repo (or a bot
that opens it). Strong blast-radius isolation; more overhead syncing
shared base configs.
Trunk-based with image tags only
Single main branch; environments differ only by which image
digest or semver tag the manifest references. CI on merge builds
myapp:sha-abc123; a separate "promote" PR bumps the tag in
the production overlay. Works well with
blue-green and canary
controllers that shift traffic between tagged revisions.
Whichever layout you choose, keep application source code and deployment manifests in related but separable repos when possible. Developers should not need cluster-admin to merge a feature branch.
Secrets in a GitOps world
Plaintext secrets must never land in Git — even in private repos. Common patterns:
- Sealed Secrets / SOPS. Encrypt secrets in Git with a cluster-side controller (Bitnami Sealed Secrets) or Mozilla SOPS with age/PGP keys. Git stores ciphertext; only the cluster decrypts at apply time. Rotation requires re-encrypting and committing.
- External secret stores. External Secrets Operator or Vault Agent Injector pull secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault at runtime. Git holds only references (secret name + key path). Best for compliance-heavy environments.
- Cloud IAM roles. Workloads use IRSA (AWS), Workload Identity (GCP), or Azure Workload Identity instead of static credentials in manifests at all.
See also secrets management for rotation schedules and least-privilege patterns that apply regardless of GitOps tooling.
Drift, rollback, and disaster recovery
GitOps makes rollback trivial in theory: git revert the bad
commit, merge, and the controller syncs the previous state. In practice:
- Database migrations are not reversed by GitOps. Pair manifest rollbacks with backward-compatible migrations or run manual down-migrations as a separate step.
- StatefulSets and PVCs may retain data across rollbacks. A reverted Deployment might point at an old image that cannot read a schema migrated forward.
- Prune + revert can delete resources you still need.
Use
Prune=falseon critical resources or finalizers during incident response.
For disaster recovery, Git is your backup of desired state. Rebuilding a cluster means installing the GitOps controller, pointing it at the repo, and syncing. Keep Terraform for cloud primitives (VPCs, node pools) in a separate layer — GitOps typically manages in-cluster resources, not the cluster itself.
Decision table: when GitOps fits
| Scenario | GitOps fit | Alternative |
|---|---|---|
| Kubernetes microservices, multiple envs | Excellent — built for this | Helm in CI push (works but no drift detection) |
| Single VPS, one Docker Compose file | Overkill | Ansible, CapRover, or plain Compose + CI SSH |
| Serverless (Lambda, Cloud Functions) | Partial — use Terraform/SAM in Git, not Argo CD | Framework-native deploy (Serverless Framework) |
| Regulated industry needing audit trail | Excellent — Git history is the audit log | Change-management tickets + manual deploy (slower) |
| Frequent hot-patch during incidents | Friction unless you disable self-heal temporarily | Push deploy with runbook override, re-sync Git after |
| Multi-cluster fleet (10+ clusters) | Excellent with app-of-apps or Flux multi-tenancy | Custom config management (hard to maintain) |
Common mistakes
- Storing plaintext secrets in Git — use Sealed Secrets, SOPS, or external stores from day one, not after a leak.
- Auto-sync + auto-prune on day one — start manual, add automation after the team trusts the diff review process.
- Mixing imperative and declarative — a pipeline that
both runs
kubectl applyand lets Argo CD manage the same resources causes fighting controllers and mystery state. - No health checks — syncing a broken manifest is fast; knowing the app actually works requires readiness probes and Argo CD health assessments (or smoke tests post-sync).
- Giant monolithic repo — one repo for 200 services means every PR triggers full-cluster diffs. Split by team or blast radius.
- Ignoring CRD install order — applying an Application before its CRD exists fails silently or loops. Use sync waves or separate bootstrap repos for platform CRDs.
Production checklist
- GitOps controller installed with least-privilege RBAC (not cluster-admin unless unavoidable).
- Protected branches on the manifest repo; required reviewers for production overlays.
- Secrets encrypted (Sealed Secrets/SOPS) or referenced from external vault — zero plaintext in Git history.
- Sync policy documented per environment: manual staging, auto-sync production with self-heal enabled only after burn-in.
- Prune enabled with resource exclusions for StatefulSets, PVCs, and
namespaces marked
prune: false. - Notifications wired (Slack, PagerDuty) on sync failure, health degradation, and out-of-sync drift.
- Rollback runbook tested: revert commit, verify controller syncs, confirm application health — including DB migration caveats.
- CI pipeline updates image tags in Git (or opens promotion PR) but does not hold cluster credentials.
- Observability: controller metrics scraped, sync duration and failure rate on a dashboard.
- Bootstrap documented: how to rebuild the cluster from Git + Terraform in under one hour.
Related reading
- Kubernetes fundamentals explained — pods, Deployments, Services, and the objects GitOps manages
- CI/CD pipelines explained — build-test-deploy automation that feeds GitOps
- Infrastructure as code explained — provisioning clusters and cloud primitives beneath GitOps
- Blue-green and canary deployments explained — progressive delivery on top of Git-synced manifests