Guide

Infrastructure as code explained

Production goes down at 2 a.m. because someone opened the AWS console, widened a security group to “fix it fast,” and never documented the change. Staging matches production on paper but runs different instance types because nobody remembers who clicked what six months ago. Infrastructure as code (IaC) replaces ad-hoc console clicks with versioned, reviewable definitions of servers, networks, load balancers, databases, and IAM policies. Terraform is the most widely adopted IaC tool: you describe desired state in HashiCorp Configuration Language (HCL), run plan to preview diffs, and apply to reconcile the cloud with your repo. This guide covers declarative vs imperative IaC, Terraform’s core workflow, state management, modules, multi-environment patterns, CI/CD integration, alternatives, anti-patterns, and how IaC connects to CI/CD pipelines, Kubernetes, and secrets management.

Why manual infrastructure fails at scale

Small teams can survive on console clicks. Past a handful of services, three problems compound:

Configuration drift — live cloud resources diverge from what anyone thinks is deployed; debugging becomes archaeology.
No audit trail — console changes lack pull-request review, blame, or rollback semantics.
Slow, error-prone reproduction — spinning up a new region or disaster-recovery environment means re-clicking hundreds of screens.

IaC treats infrastructure like application code: branches, diffs, tests, and automated promotion through environments. The goal is not “never touch the console” — it is that every durable change flows through the repo so staging and production stay explainable.

Declarative vs imperative IaC

Imperative tools (early Chef/Puppet scripts, shell loops) say how to reach a state: “install nginx, edit this file, restart the service.” Declarative tools (Terraform, CloudFormation, Pulumi with declarative mode) say what you want: “a load balancer on port 443 forwarding to these three instances.” The engine computes the minimal diff.

Declarative IaC wins for cloud APIs because providers already expose create/update/ delete operations. You describe the end state; Terraform’s dependency graph orders creates and updates safely. Imperative scripts still help for one-off bootstrap tasks, but durable cloud topology belongs in declarative files.

Terraform core concepts

Providers, resources, and data sources

A provider plugin talks to an API — AWS, Google Cloud, Azure, Cloudflare, GitHub, or Kubernetes. A resource is something Terraform manages (aws_instance, google_cloud_run_service). A data source reads existing infrastructure without owning it (look up a VPC ID, fetch a TLS certificate ARN).

Variables, outputs, and locals

Variables parameterize modules — region, instance size, environment name. Outputs export values to other stacks or humans (load balancer DNS name, database endpoint). Locals are computed convenience values inside a module. Typed variables with validation blocks catch misconfiguration before apply.

The plan / apply loop

terraform init — download providers, configure backend.
terraform plan — compute diff between desired config and state; shows creates, updates, destroys. Run this in CI on every PR.
terraform apply — execute the plan after human or policy approval.

Never apply blind. A plan that destroys a production database because someone renamed a resource is a classic Terraform incident — resource addresses are identity keys; renames often imply destroy+create unless you use moved blocks or terraform state mv.

State: the source of truth Terraform remembers

Terraform stores a state file mapping HCL addresses to real cloud IDs. Without state, it cannot know that aws_instance.web is i-0abc123 rather than a new machine to create.

Remote state and locking

Local terraform.tfstate on a laptop is fine for learning; production teams use remote backends — S3 + DynamoDB locking, GCS + Cloud Storage, Terraform Cloud, or equivalent. Remote state enables:

Team collaboration without emailing state files.
State locking so two applies never race.
Encrypted storage and versioning for rollback.

Drift detection

Console edits create drift — state says one thing, reality another. Scheduled terraform plan in CI (read-only) alerts when drift appears. Policy: either import the change into code or revert the manual edit. Letting drift accumulate guarantees the next apply surprises someone.

State security

State often contains secrets (database passwords in plaintext unless using sensitive outputs and external secret stores). Treat state buckets like production databases: encryption at rest, tight IAM, no public access. Prefer referencing secrets from Vault, AWS Secrets Manager, or SSM rather than embedding them in .tfvars committed to git.

Modules: reusable infrastructure packages

A module is a directory of .tf files with inputs and outputs. Instead of copy-pasting fifty lines per environment, you call:

module "vpc" {
  source  = "./modules/vpc"
  cidr    = var.vpc_cidr
  region  = var.aws_region
}

Good module design principles:

Single responsibility — one module for a VPC, another for an ECS service, not a god-module that provisions everything.
Sensible defaults — expose only knobs callers need; hide internal wiring.
Version pinning — reference public modules by git tag or registry version, not master branch.
Documented contracts — README listing required variables, outputs, and side effects (creates NAT gateways = recurring cost).

Split large systems into stack boundaries: networking stack, data stack, application stack — each with its own state file. Smaller blast radius; networking changes do not lock the app stack during apply.

Environments: workspaces, directories, or separate accounts

Three common patterns for dev / staging / prod:

Directory per environment — envs/dev, envs/prod each with its own backend key. Clearest isolation; duplicate config unless modules abstract it.
Workspaces — one code tree, Terraform workspaces swap state namespaces. Convenient for small teams; easy to accidentally apply to wrong workspace without guardrails.
Separate cloud accounts — AWS Organizations with dev and prod accounts. Strongest blast-radius isolation; IAM and provider aliases wire modules to the right account.

Pair environment separation with promotion in CI/CD: plan on PR, apply to dev automatically, manual approval gate before prod apply. Tag resources with Environment = prod for cost allocation and policy engines.

IaC in the CI/CD pipeline

Mature teams run Terraform inside the same pipeline discipline as application deploys:

fmt and validate on every commit — terraform fmt -check, terraform validate.
Plan on pull request — post the diff as a PR comment; reviewers see exactly which resources change.
Policy as code — Open Policy Agent (OPA), Sentinel, or Checkov scan plans for public S3 buckets, unencrypted disks, overly broad IAM.
Apply on merge — only from CI with OIDC federation to cloud roles (no long-lived AWS keys in GitHub secrets).
Smoke tests post-apply — curl health endpoints, verify DNS resolves, confirm autoscaler registered targets.

Coordinate IaC applies with application blue-green or canary deploys: infrastructure changes (new subnet, larger instance class) may need to land before or after the app rollout depending on compatibility.

Terraform vs alternatives

AWS CloudFormation / CDK — native to AWS; CDK generates CloudFormation from TypeScript/Python. Excellent if you are AWS-only; weaker multi-cloud story.
Pulumi — real programming languages (TypeScript, Go, Python) instead of HCL. Strong for teams who want loops, unit tests, and IDE refactoring on infrastructure.
Crossplane / Kubernetes operators — manage cloud resources as Kubernetes CRDs. Fits GitOps-native shops already running K8s as the control plane.
Ansible / shell — configuration management and bootstrap, not full declarative cloud topology. Often complements Terraform (Terraform provisions the VM; Ansible installs packages).
Docker Compose — local/dev container stacks per our Docker fundamentals guide; graduate to Terraform + orchestrator for production multi-node systems.

Terraform’s advantage is provider breadth and a large module ecosystem. Its weakness is HCL’s limited abstraction compared to general-purpose languages — which is why Pulumi and CDK exist.

Common anti-patterns

ClickOps fixes without code updates — drift until the next apply deletes production resources.
Monolithic state file — one apply locks everything; a typo in a dev resource blocks prod changes.
Secrets in git — .tfvars with API keys committed; use secret managers and CI-injected vars.
No plan in CI — developers apply from laptops without review.
Renaming resources carelessly — triggers destroy/recreate on stateful resources (RDS, EBS).
Ignoring prevent_destroy — lifecycle guards on databases and state buckets save teams once; use them.
Provisioning app config in Terraform — user data scripts that grow into undebuggable bash blobs; provision compute/networking in Terraform, configure apps via containers or config management.

Production checklist

Store state remotely with locking and encryption enabled.
Run terraform plan on every PR; require human review for prod applies.
Split stacks by blast radius (network, data, app).
Pin provider and module versions; upgrade deliberately with plan diffs.
Never commit secrets; integrate with a secrets manager.
Tag all resources for environment, owner, and cost center.
Schedule drift detection; reconcile or import manual changes promptly.
Use OIDC for CI cloud authentication instead of static keys.
Document rollback: state versioning, terraform apply of previous git tag, and which resources are safe to destroy.
Pair IaC changes with application deploy strategy and database migration plans.

Key takeaways

IaC makes cloud infrastructure versioned, reviewable, and reproducible.
Terraform uses declarative HCL, providers, and state to compute minimal diffs.
Remote state + locking are non-negotiable for teams; treat state as sensitive data.
Modules and stack splits control complexity and blast radius.
CI/CD integration — plan on PR, policy checks, gated apply — prevents ClickOps regression.