Guide

GitOps explained

Traditional deployment pipelines push changes to production: a CI job runs kubectl apply or calls a cloud API with credentials that can modify anything. GitOps flips the direction. You declare the desired state in Git — Kubernetes manifests, Helm charts, Kustomize overlays, Terraform modules — and a controller running inside the cluster continuously pulls that state and reconciles reality to match. Git becomes the single source of truth; every change is a pull request with review, audit trail, and instant rollback via git revert. This guide covers the core principles, how pull-based reconciliation works, popular tools (Argo CD and Flux), repository layout, secrets patterns, drift detection, environment promotion, and when GitOps beats push-based CI/CD — with links to Kubernetes fundamentals and infrastructure as code for the surrounding stack.

The four GitOps principles

The OpenGitOps project distills GitOps into four principles that every implementation should satisfy:

Declarative. The entire system is described declaratively — YAML manifests, Helm values, Terraform HCL. You state what you want, not imperative shell scripts that mutate state step by step.
Versioned and immutable. That declarative description lives in Git (or another version-control system). Every change has an author, timestamp, diff, and optional code review. Tags and branches represent releases.
Applied automatically. Software agents (Argo CD, Flux, Terraform Cloud) pull the declared state and apply it without manual kubectl sessions. Humans approve PRs; machines deploy.
Continuously reconciled. Agents do not just apply once — they watch for drift. If someone hand-edits a Deployment replica count in the live cluster, the controller detects the mismatch and either reverts to Git or raises an alert.

Together, these principles turn operations into a software engineering workflow: the same branching, review, and rollback habits that keep application code safe now protect production infrastructure.

Push vs pull: why direction matters

In a push-based pipeline, your CI server holds cluster admin credentials. A green build triggers helm upgrade or terraform apply from outside the cluster. This works for small teams but creates problems at scale:

Credential sprawl. Every CI job, script, and engineer who can trigger a deploy needs powerful API keys. Leaked Jenkins credentials have caused real breaches.
No continuous drift detection. Push pipelines run on events (merge to main). Between deploys, manual kubectl edit changes go unnoticed until the next push overwrites them — or never gets corrected.
Audit gaps. "Who changed replica count to 50?" might be answered by CI logs, cluster audit logs, or a Slack message — three different sources.

In a pull-based model, only the in-cluster GitOps controller has write access to the API server. CI builds and pushes a container image; it updates a manifest in Git (image tag bump) and stops. Argo CD or Flux notices the commit, pulls the new manifest, and applies it. The blast radius of a compromised CI pipeline shrinks dramatically — attackers can change what Git says should run, but they cannot directly shell into production unless they also merge to the protected branch.

The reconciliation loop

Whether you use Argo CD, Flux, or a homegrown operator, the core loop is identical:

Observe — read the desired state from Git (a branch, tag, or semver range).
Compare — diff desired state against live cluster resources (Deployments, Services, ConfigMaps, CRDs).
Act — create, update, or prune resources to converge.
Report — surface sync status, health, and diffs in a UI or metrics endpoint.

Sync policies control automation level:

Manual sync — operator clicks "Sync" in the Argo CD UI after reviewing the diff. Safest for early adoption.
Auto-sync — controller applies Git changes within seconds of merge. Standard for staging; use with care in production.
Self-heal — when enabled, the controller reverts manual cluster edits back to Git. Powerful for preventing configuration drift; frustrating if on-call engineers cannot kubectl-patch during an incident without disabling self-heal first.
Prune — delete cluster resources that no longer exist in Git. Essential for clean teardowns; dangerous if someone deletes a manifest by accident.

Pair auto-sync with branch protection and required reviewers. The controller is only as safe as the merge policy guarding the branch it watches.

Argo CD vs Flux: choosing a controller

Both are CNCF-graduated, Kubernetes-native, and production-proven. The choice is often organizational rather than technical:

Argo CD

Rich web UI showing application topology, sync status, and live diffs — excellent for teams new to GitOps.
Application CRD wraps Helm, Kustomize, plain YAML, Jsonnet, and plugins in one abstraction.
Multi-cluster management via Argo CD managing remote clusters from a central instance.
Part of the broader Argo suite (Workflows, Events, Rollouts) for progressive delivery.

Flux CD

Modular controllers: source-controller fetches Git/OCI artifacts; kustomize-controller and helm-controller apply them. Compose only what you need.
Native HelmRelease and Kustomization CRDs — no wrapper Application object.
Strong OCI artifact support (push Helm charts to container registries).
Lighter UI footprint; status via flux get CLI and Prometheus metrics. GitOps Toolkit integrates cleanly with observability stacks.

Many enterprises run both: Argo CD for application teams who want a UI, Flux for platform teams managing cluster add-ons. Either beats shell scripts in CI.

Repository layout and environment promotion

How you structure Git repos determines how painful promotion from staging to production becomes. Three common patterns:

Monorepo with overlay directories

apps/
  payment-api/
    base/           # shared Deployment + Service
    overlays/
      staging/      # 1 replica, staging image tag
      production/   # 3 replicas, prod image tag

Kustomize patches per environment. Argo CD Application points staging at overlays/staging, production at overlays/production. Promotion = PR that copies or retags the image reference in the production overlay.

Repo-per-environment

Separate infra-staging and infra-production repos. Promotion is a PR from staging repo to production repo (or a bot that opens it). Strong blast-radius isolation; more overhead syncing shared base configs.

Trunk-based with image tags only

Single main branch; environments differ only by which image digest or semver tag the manifest references. CI on merge builds myapp:sha-abc123; a separate "promote" PR bumps the tag in the production overlay. Works well with blue-green and canary controllers that shift traffic between tagged revisions.

Whichever layout you choose, keep application source code and deployment manifests in related but separable repos when possible. Developers should not need cluster-admin to merge a feature branch.

Secrets in a GitOps world

Plaintext secrets must never land in Git — even in private repos. Common patterns:

Sealed Secrets / SOPS. Encrypt secrets in Git with a cluster-side controller (Bitnami Sealed Secrets) or Mozilla SOPS with age/PGP keys. Git stores ciphertext; only the cluster decrypts at apply time. Rotation requires re-encrypting and committing.
External secret stores. External Secrets Operator or Vault Agent Injector pull secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault at runtime. Git holds only references (secret name + key path). Best for compliance-heavy environments.
Cloud IAM roles. Workloads use IRSA (AWS), Workload Identity (GCP), or Azure Workload Identity instead of static credentials in manifests at all.

See also secrets management for rotation schedules and least-privilege patterns that apply regardless of GitOps tooling.

Drift, rollback, and disaster recovery

GitOps makes rollback trivial in theory: git revert the bad commit, merge, and the controller syncs the previous state. In practice:

Database migrations are not reversed by GitOps. Pair manifest rollbacks with backward-compatible migrations or run manual down-migrations as a separate step.
StatefulSets and PVCs may retain data across rollbacks. A reverted Deployment might point at an old image that cannot read a schema migrated forward.
Prune + revert can delete resources you still need. Use Prune=false on critical resources or finalizers during incident response.

For disaster recovery, Git is your backup of desired state. Rebuilding a cluster means installing the GitOps controller, pointing it at the repo, and syncing. Keep Terraform for cloud primitives (VPCs, node pools) in a separate layer — GitOps typically manages in-cluster resources, not the cluster itself.

Decision table: when GitOps fits

Scenario	GitOps fit	Alternative
Kubernetes microservices, multiple envs	Excellent — built for this	Helm in CI push (works but no drift detection)
Single VPS, one Docker Compose file	Overkill	Ansible, CapRover, or plain Compose + CI SSH
Serverless (Lambda, Cloud Functions)	Partial — use Terraform/SAM in Git, not Argo CD	Framework-native deploy (Serverless Framework)
Regulated industry needing audit trail	Excellent — Git history is the audit log	Change-management tickets + manual deploy (slower)
Frequent hot-patch during incidents	Friction unless you disable self-heal temporarily	Push deploy with runbook override, re-sync Git after
Multi-cluster fleet (10+ clusters)	Excellent with app-of-apps or Flux multi-tenancy	Custom config management (hard to maintain)

Common mistakes

Storing plaintext secrets in Git — use Sealed Secrets, SOPS, or external stores from day one, not after a leak.
Auto-sync + auto-prune on day one — start manual, add automation after the team trusts the diff review process.
Mixing imperative and declarative — a pipeline that both runs kubectl apply and lets Argo CD manage the same resources causes fighting controllers and mystery state.
No health checks — syncing a broken manifest is fast; knowing the app actually works requires readiness probes and Argo CD health assessments (or smoke tests post-sync).
Giant monolithic repo — one repo for 200 services means every PR triggers full-cluster diffs. Split by team or blast radius.
Ignoring CRD install order — applying an Application before its CRD exists fails silently or loops. Use sync waves or separate bootstrap repos for platform CRDs.

Production checklist

GitOps controller installed with least-privilege RBAC (not cluster-admin unless unavoidable).
Protected branches on the manifest repo; required reviewers for production overlays.
Secrets encrypted (Sealed Secrets/SOPS) or referenced from external vault — zero plaintext in Git history.
Sync policy documented per environment: manual staging, auto-sync production with self-heal enabled only after burn-in.
Prune enabled with resource exclusions for StatefulSets, PVCs, and namespaces marked prune: false.
Notifications wired (Slack, PagerDuty) on sync failure, health degradation, and out-of-sync drift.
Rollback runbook tested: revert commit, verify controller syncs, confirm application health — including DB migration caveats.
CI pipeline updates image tags in Git (or opens promotion PR) but does not hold cluster credentials.
Observability: controller metrics scraped, sync duration and failure rate on a dashboard.
Bootstrap documented: how to rebuild the cluster from Git + Terraform in under one hour.