Guide
Infrastructure as code explained
Production goes down at 2 a.m. because someone opened the AWS console, widened
a security group to “fix it fast,” and never documented the change.
Staging matches production on paper but runs different instance types because
nobody remembers who clicked what six months ago.
Infrastructure as code (IaC) replaces ad-hoc console clicks with
versioned, reviewable definitions of servers, networks, load balancers, databases,
and IAM policies. Terraform is the most widely adopted IaC tool:
you describe desired state in HashiCorp Configuration Language (HCL), run
plan to preview diffs, and apply to reconcile the cloud
with your repo. This guide covers declarative vs imperative IaC, Terraform’s
core workflow, state management, modules, multi-environment patterns, CI/CD integration,
alternatives, anti-patterns, and how IaC connects to
CI/CD pipelines,
Kubernetes, and
secrets management.
Why manual infrastructure fails at scale
Small teams can survive on console clicks. Past a handful of services, three problems compound:
- Configuration drift — live cloud resources diverge from what anyone thinks is deployed; debugging becomes archaeology.
- No audit trail — console changes lack pull-request review, blame, or rollback semantics.
- Slow, error-prone reproduction — spinning up a new region or disaster-recovery environment means re-clicking hundreds of screens.
IaC treats infrastructure like application code: branches, diffs, tests, and automated promotion through environments. The goal is not “never touch the console” — it is that every durable change flows through the repo so staging and production stay explainable.
Declarative vs imperative IaC
Imperative tools (early Chef/Puppet scripts, shell loops) say how to reach a state: “install nginx, edit this file, restart the service.” Declarative tools (Terraform, CloudFormation, Pulumi with declarative mode) say what you want: “a load balancer on port 443 forwarding to these three instances.” The engine computes the minimal diff.
Declarative IaC wins for cloud APIs because providers already expose create/update/ delete operations. You describe the end state; Terraform’s dependency graph orders creates and updates safely. Imperative scripts still help for one-off bootstrap tasks, but durable cloud topology belongs in declarative files.
Terraform core concepts
Providers, resources, and data sources
A provider plugin talks to an API — AWS, Google Cloud, Azure,
Cloudflare, GitHub, or
Kubernetes.
A resource is something Terraform manages (aws_instance,
google_cloud_run_service). A data source reads existing
infrastructure without owning it (look up a VPC ID, fetch a TLS certificate ARN).
Variables, outputs, and locals
Variables parameterize modules — region, instance size, environment
name. Outputs export values to other stacks or humans (load
balancer DNS name, database endpoint). Locals are computed
convenience values inside a module. Typed variables with validation blocks catch
misconfiguration before apply.
The plan / apply loop
terraform init— download providers, configure backend.terraform plan— compute diff between desired config and state; shows creates, updates, destroys. Run this in CI on every PR.terraform apply— execute the plan after human or policy approval.
Never apply blind. A plan that destroys a production database because
someone renamed a resource is a classic Terraform incident — resource addresses
are identity keys; renames often imply destroy+create unless you use
moved blocks or terraform state mv.
State: the source of truth Terraform remembers
Terraform stores a state file mapping HCL addresses to real cloud
IDs. Without state, it cannot know that aws_instance.web is
i-0abc123 rather than a new machine to create.
Remote state and locking
Local terraform.tfstate on a laptop is fine for learning; production
teams use remote backends — S3 + DynamoDB locking, GCS + Cloud
Storage, Terraform Cloud, or equivalent. Remote state enables:
- Team collaboration without emailing state files.
- State locking so two applies never race.
- Encrypted storage and versioning for rollback.
Drift detection
Console edits create drift — state says one thing, reality another.
Scheduled terraform plan in CI (read-only) alerts when drift appears.
Policy: either import the change into code or revert the manual edit. Letting drift
accumulate guarantees the next apply surprises someone.
State security
State often contains secrets (database passwords in plaintext unless using
sensitive outputs and external secret stores). Treat state buckets
like production databases: encryption at rest, tight IAM, no public access. Prefer
referencing secrets from
Vault, AWS Secrets Manager,
or SSM rather than embedding them in .tfvars committed to git.
Modules: reusable infrastructure packages
A module is a directory of .tf files with inputs and
outputs. Instead of copy-pasting fifty lines per environment, you call:
module "vpc" {
source = "./modules/vpc"
cidr = var.vpc_cidr
region = var.aws_region
}
Good module design principles:
- Single responsibility — one module for a VPC, another for an ECS service, not a god-module that provisions everything.
- Sensible defaults — expose only knobs callers need; hide internal wiring.
- Version pinning — reference public modules by git tag or
registry version, not
masterbranch. - Documented contracts — README listing required variables, outputs, and side effects (creates NAT gateways = recurring cost).
Split large systems into stack boundaries: networking stack, data stack, application stack — each with its own state file. Smaller blast radius; networking changes do not lock the app stack during apply.
Environments: workspaces, directories, or separate accounts
Three common patterns for dev / staging / prod:
- Directory per environment —
envs/dev,envs/prodeach with its own backend key. Clearest isolation; duplicate config unless modules abstract it. - Workspaces — one code tree, Terraform workspaces swap state namespaces. Convenient for small teams; easy to accidentally apply to wrong workspace without guardrails.
- Separate cloud accounts — AWS Organizations with dev and prod accounts. Strongest blast-radius isolation; IAM and provider aliases wire modules to the right account.
Pair environment separation with promotion in
CI/CD: plan on PR,
apply to dev automatically, manual approval gate before prod apply. Tag resources
with Environment = prod for cost allocation and policy engines.
IaC in the CI/CD pipeline
Mature teams run Terraform inside the same pipeline discipline as application deploys:
- fmt and validate on every commit —
terraform fmt -check,terraform validate. - Plan on pull request — post the diff as a PR comment; reviewers see exactly which resources change.
- Policy as code — Open Policy Agent (OPA), Sentinel, or Checkov scan plans for public S3 buckets, unencrypted disks, overly broad IAM.
- Apply on merge — only from CI with OIDC federation to cloud roles (no long-lived AWS keys in GitHub secrets).
- Smoke tests post-apply — curl health endpoints, verify DNS resolves, confirm autoscaler registered targets.
Coordinate IaC applies with application blue-green or canary deploys: infrastructure changes (new subnet, larger instance class) may need to land before or after the app rollout depending on compatibility.
Terraform vs alternatives
- AWS CloudFormation / CDK — native to AWS; CDK generates CloudFormation from TypeScript/Python. Excellent if you are AWS-only; weaker multi-cloud story.
- Pulumi — real programming languages (TypeScript, Go, Python) instead of HCL. Strong for teams who want loops, unit tests, and IDE refactoring on infrastructure.
- Crossplane / Kubernetes operators — manage cloud resources as Kubernetes CRDs. Fits GitOps-native shops already running K8s as the control plane.
- Ansible / shell — configuration management and bootstrap, not full declarative cloud topology. Often complements Terraform (Terraform provisions the VM; Ansible installs packages).
- Docker Compose — local/dev container stacks per our Docker fundamentals guide; graduate to Terraform + orchestrator for production multi-node systems.
Terraform’s advantage is provider breadth and a large module ecosystem. Its weakness is HCL’s limited abstraction compared to general-purpose languages — which is why Pulumi and CDK exist.
Common anti-patterns
- ClickOps fixes without code updates — drift until the next apply deletes production resources.
- Monolithic state file — one apply locks everything; a typo in a dev resource blocks prod changes.
- Secrets in git —
.tfvarswith API keys committed; use secret managers and CI-injected vars. - No plan in CI — developers apply from laptops without review.
- Renaming resources carelessly — triggers destroy/recreate on stateful resources (RDS, EBS).
- Ignoring
prevent_destroy— lifecycle guards on databases and state buckets save teams once; use them. - Provisioning app config in Terraform — user data scripts that grow into undebuggable bash blobs; provision compute/networking in Terraform, configure apps via containers or config management.
Production checklist
- Store state remotely with locking and encryption enabled.
- Run
terraform planon every PR; require human review for prod applies. - Split stacks by blast radius (network, data, app).
- Pin provider and module versions; upgrade deliberately with plan diffs.
- Never commit secrets; integrate with a secrets manager.
- Tag all resources for environment, owner, and cost center.
- Schedule drift detection; reconcile or import manual changes promptly.
- Use OIDC for CI cloud authentication instead of static keys.
- Document rollback: state versioning,
terraform applyof previous git tag, and which resources are safe to destroy. - Pair IaC changes with application deploy strategy and database migration plans.
Key takeaways
- IaC makes cloud infrastructure versioned, reviewable, and reproducible.
- Terraform uses declarative HCL, providers, and state to compute minimal diffs.
- Remote state + locking are non-negotiable for teams; treat state as sensitive data.
- Modules and stack splits control complexity and blast radius.
- CI/CD integration — plan on PR, policy checks, gated apply — prevents ClickOps regression.
Related reading
- CI/CD pipelines — automate plan/apply gates, environment promotion, and rollback playbooks
- Kubernetes fundamentals — orchestrate containers on infrastructure Terraform provisions
- Secrets management — keep credentials out of state files and version control
- Docker fundamentals — container images and Compose for local dev before cloud deploy