Guide

Database migrations explained

Harbor Supply's checkout API ran fine in staging. Production deploy at 2 a.m. added a fulfillment_status column with NOT NULL and no default — PostgreSQL rewrote the entire orders table under an exclusive lock while eight API pods still served traffic against the old schema. Checkout returned 500s for forty minutes. The fix was not “run migrations faster”; it was treating database migrations as a first-class release artifact: versioned, reviewed, ordered with application deploys, and designed for zero-downtime when tables are large. Migrations are how teams evolve schema — new columns, indexes, constraints, partitions — without losing data or taking unplanned outages. They differ from ad-hoc SQL in production because every change is tracked, replayable on fresh environments, and auditable. This guide covers migration fundamentals, forward-only vs reversible scripts, the expand-contract pattern, tool ecosystems, a Harbor Supply order-schema worked example, an approach decision table, common pitfalls, and a production checklist. Pair migrations with transaction and isolation knowledge and PostgreSQL internals when you need to reason about locks and MVCC during DDL.

What a database migration is

A migration is a numbered, ordered change to database structure or reference data. Each step is typically a SQL file (or generated SQL from an ORM) with an up direction that applies the change and sometimes a down direction that reverses it. A migration history table (e.g. schema_migrations, _prisma_migrations, flyway_schema_history) records which versions ran on which database so the same sequence can be replayed on developer laptops, CI databases, staging, and production.

Migrations solve three problems manual SQL does not. First, reproducibility: a new engineer clones the repo and gets an identical schema by running the chain from empty to current. Second, coordination: application code and schema version move together — deploy v2.4 only after migration 20240610_add_status ran. Third, reviewability: DDL diffs go through pull request review like application code, catching full-table locks and destructive drops before they hit production. Migrations are not a substitute for backups or disaster recovery; they are the controlled path for intentional schema evolution.

Versioning and ordering rules

Every migration needs a unique, monotonically increasing identifier. Common schemes: timestamp prefixes (20240610120000_add_index), sequential integers (V14__add_orders_index.sql), or content hashes. Never reuse or reorder applied migrations — once migration 14 ran in production, editing file 14 breaks every environment that already recorded it. If you made a mistake, ship migration 15 that corrects it.

Immutability is the golden rule: applied migrations are append-only history. Teams that rewrite old files eventually fork dev and prod histories, requiring painful manual repair. Squashing is allowed only before any shared environment has applied the squashed range — typically early in a greenfield project, not on a live orders table with five years of history.

Run migrations in CI against ephemeral databases on every pull request. Fail the build if migration N applies but N+1 fails, if generated SQL differs from committed files, or if a migration takes longer than your lock budget on a representative data volume.

Forward-only vs reversible migrations

Reversible (up/down) migrations let you roll back schema by running down scripts. They work well in early development and for additive changes (drop column in down after add column in up). In production, true reversibility is rare: you cannot un-drop a column if new rows wrote data into it, and dropping an index that query plans now depend on may cause sudden full scans.

Mature teams default to forward-only migrations plus application rollback. If deploy v2.5 is bad, roll the app back to v2.4 — but leave schema at the v2.5 level if the new column is nullable and ignored by old code. Forward-only pairs naturally with expand-contract (below). Document “rollback plan” as “redeploy previous app binary” plus “feature flag off,” not “run down.sql on prod Friday night.”

Zero-downtime: the expand-contract pattern

Breaking schema changes — rename column, change type, add NOT NULL, split table — need multiple deploy steps so old and new application versions coexist during rolling updates. The expand-contract (or expand-migrate-contract) pattern splits work across releases:

  1. Expand. Add the new column/table/index in a backward-compatible way (nullable column, new table alongside old). Deploy migration; old code ignores the addition.
  2. Migrate data. Backfill in batches with UPDATE ... WHERE id BETWEEN or background jobs. Avoid one transaction that locks millions of rows. Use CONCURRENTLY for index creation in PostgreSQL.
  3. Dual-write / dual-read. Application writes to both old and new fields; reads prefer new with fallback. Verify consistency with reconciliation jobs.
  4. Contract. After all pods run new code, migration removes the old column, tightens constraints, or drops deprecated objects.

Renaming status to fulfillment_status without downtime: add fulfillment_status, backfill from status, deploy code that writes both, switch reads, deploy code that writes only new column, drop status in a final migration. Skipping steps causes the Harbor Supply outage pattern: schema and code changed in a single atomic deploy on a locked table.

DDL operations and lock awareness

Not all DDL is equal. Adding a nullable column with no default is fast on PostgreSQL 11+ (metadata-only). Adding NOT NULL without a default forces a full table rewrite and exclusive lock. Creating an index without CONCURRENTLY blocks writes. Foreign keys validated immediately lock both tables.

Before shipping migration SQL, ask: what lock mode does this take, how long on our largest table, and can we run it online? Use pg_stat_activity and lock_timeout in maintenance windows for unavoidable exclusive operations. For index design, plan indexes in migrations separate from column adds so you can create them concurrently and monitor bloat.

Migration tools and ORM integration

Tools fall into SQL-first and schema-first families. Flyway and Liquibase run versioned SQL or XML changelogs; popular in Java and polyglot shops. Alembic pairs with SQLAlchemy in Python. golang-migrate and goose serve Go services. Prisma Migrate, Drizzle Kit, and Rails Active Record generate SQL from declarative schema files — convenient for TypeScript and Ruby teams but requiring discipline to hand-edit generated SQL for production-safe DDL.

ORM-generated migrations excel at developer velocity; SQL-first tools excel when DBAs own review. Hybrid teams generate drafts, then curate SQL for CONCURRENTLY, batch backfills, and comments. Whichever tool you pick, store migrations in git beside application code and run them from CI/CD with the same promotion gates as binaries.

Harbor Supply order schema evolution (worked example)

Harbor Supply needed to split a single status enum (“pending”, “shipped”, “cancelled”) into separate payment_status and fulfillment_status fields for a new warehouse integration. The wrong approach — one migration renaming and retyping under lock — caused their outage. The corrected plan:

  1. Migration 041 (expand). ALTER TABLE orders ADD COLUMN payment_status text, ADD COLUMN fulfillment_status text; Both nullable. Deploy; v3.2 app unchanged.
  2. Backfill job. Batch update 10k rows at a time: map old status to both new columns. Run off-peak; monitor replication lag.
  3. Deploy v3.3. Write both old and new columns on every order update; read from new columns with fallback to status.
  4. Deploy v3.4. Read/write only new columns.
  5. Migration 042 (contract). Drop status; add check constraints on allowed enum values; add partial index on fulfillment_status where not terminal.

Total calendar time: two weeks. Zero checkout downtime. The team added a migration review checklist to every PR touching prisma/migrations/ and required staging load tests on a snapshot with production row counts.

Approach decision table

Scenario Recommended approach Avoid
New nullable column, app can ignore it Single forward migration; deploy app after Blocking NOT NULL in same release
Rename or retype column on large table Expand-contract across 2–4 releases ALTER ... RENAME during traffic
New index on 100M-row table CREATE INDEX CONCURRENTLY in dedicated migration Default index creation inside transaction
Seed reference data (countries, roles) Idempotent INSERT ... ON CONFLICT in migration Destructive DELETE then INSERT without guard
Destructive drop column/table Contract phase after telemetry proves zero reads Dropping before all app versions migrated
Multi-service shared database (legacy) Coordinate migration order; feature flags per service Assuming one deploy owns all readers/writers

Common pitfalls

  • Editing applied migrations. Forks history; always append a fix-forward migration.
  • Schema ahead of code. New NOT NULL column deploys before app writes it — insert failures.
  • Code ahead of schema. App queries new column before migration runs — column missing errors.
  • Long transactions in migrations. Backfills in one TX hold locks and fill WAL; batch instead.
  • Missing lock timeouts. A stuck migration blocks all writers until someone kills the session.
  • Environment drift. Manual hotfix SQL in prod not captured in git — staging never matches prod.
  • Data loss on down. Down migration drops column with production data; forward-only plus backups is safer.
  • Ignoring replication. Heavy DDL on primary stalls replicas; watch lag during backfills.

Production checklist

  • Every schema change lives in a versioned migration file committed to git.
  • CI applies full migration chain on ephemeral DB for each PR.
  • Review checklist covers lock mode, estimated duration, and rollback (app version, not down.sql).
  • Large-table changes use expand-contract or online DDL (CONCURRENTLY).
  • Backfills are batched with progress metrics and replication lag alerts.
  • Migration runs before or during deploy per documented order; document which.
  • Staging migration tested on snapshot with production-scale row counts.
  • Backup or PITR verified before irreversible contract-phase migrations.
  • Migration history table is never hand-edited except documented disaster recovery.
  • Post-deploy monitors query errors and slow queries tied to new indexes or constraints.

Key takeaways

  • Migrations are release artifacts. Schema and application version must advance in a planned sequence.
  • Immutability preserves trust. Append fix-forward migrations; never rewrite applied history.
  • Zero-downtime is a design choice. Expand-contract turns risky DDL into boring, incremental steps.
  • Locks matter more than syntax. Understand what PostgreSQL does on your largest table before merge.
  • Forward-only plus app rollback beats reversible down scripts in real production incidents.

Related reading