Guide

Database migration strategies explained

Application code ships several times a day; the database schema often lags behind because a bad migration can take down production faster than any bug in your API. A database migration is a controlled, versioned change to schema or data — adding a column, splitting a table, backfilling a default, or dropping an index. The hard part is not writing ALTER TABLE; it is coordinating schema changes with running traffic, long-running transactions, and replicas that apply changes at different speeds. This guide covers the strategies teams use to evolve Postgres, MySQL, and other relational stores without maintenance windows: versioned migration files, the expand-contract pattern, batched backfills, locking awareness, and rollback discipline tied into your CI/CD pipeline.

Versioned migrations and migration tools

Treat schema history like source code. Each change is an immutable file with a monotonic version number or timestamp — V001__create_users.sql, 20260607_add_email_index.py — checked into git and applied in order. Tools such as Flyway, Liquibase, Alembic (Python/SQLAlchemy), Rails migrations, and Prisma Migrate track which versions ran against each environment in a schema_migrations table so you never apply the same change twice.

Good migration hygiene: one logical change per file, idempotent guards where the engine allows (IF NOT EXISTS), and never editing a migration that already ran in production — append a new file instead. Pair migrations with application deploys using the same idempotency mindset you use for API retries: if a deploy restarts mid-migration, the tool should resume safely. Run migrations in CI against a disposable database cloned from recent schema so broken SQL fails before merge, not at 2 a.m. on Friday.

The expand-contract pattern for zero-downtime changes

The most common outage pattern is renaming or retyping a column in a single deploy: old app code reads user_name, new schema has display_name, and every write 500s until versions align. Expand-contract (also called parallel change) splits risky edits into three phases across multiple deploys:

  1. Expand — add the new column, table, or index alongside the old one. Deploy application code that writes to both but still reads from the old field. No consumer breaks because the old path still works.
  2. Migrate — backfill data from old to new in batches (see below). Turn on dual-read or switch reads to the new field once backfill coverage is complete and verified.
  3. Contract — remove writes to the old field, then drop the old column or table in a later migration after you are confident no rollback is needed.

Renaming status to order_state, splitting a JSON blob into normalized columns, or changing an enum all follow the same rhythm: never break readers and writers in the same release. Feature flags can gate which code path is active during the migrate phase so you can roll back application logic without reversing schema.

Data backfills without melting the database

Schema migrations are often fast; data migrations are not. Populating a new NOT NULL column for 400 million rows with a single UPDATE users SET tier = 'free' can hold row locks, fill the WAL, and stall replicas. Instead:

  • Batch by primary key — update 5,000–50,000 rows per transaction with WHERE id > ? ORDER BY id LIMIT N, sleep briefly between batches, and monitor replication lag.
  • Add nullable first, backfill, then enforce — add the column nullable, backfill in jobs, add a CHECK or NOT NULL constraint only when null count hits zero.
  • Use triggers or generated columns temporarily — keep old and new in sync during transition; drop the trigger in the contract phase.
  • Run heavy backfills off the critical path — dedicated worker with rate limits, not inside the deploy hook that blocks app startup.

Track backfill progress in a metrics dashboard: rows remaining, batch duration, replica lag. Abort or slow down if lag exceeds SLO. For tables under shard boundaries, backfill shard-by-shard so one hot partition does not dominate.

Locking, ALTER TABLE, and online DDL

Not all DDL is equal. Adding a nullable column without a default is often cheap on modern Postgres; adding a NOT NULL column with a volatile default may rewrite the entire table. Creating an index can block writes unless you use CONCURRENTLY (Postgres) or ALGORITHM=INPLACE, LOCK=NONE (MySQL InnoDB, when supported). Dropping a column referenced by a view, dropping a heavily used index, or changing column type can lock or rebuild for minutes on large tables.

Before production: run the migration against a staging copy with comparable row counts and measure lock wait time. Use pg_stat_activity or performance_schema to see what blocks. Schedule risky DDL in low-traffic windows even when using online algorithms — replicas still apply the change and may lag. Long DDL interacts badly with open transactions; document maximum transaction duration and connection pool behavior so migrations do not wait behind idle-in-transaction sessions. Understanding isolation levels and MVCC helps explain why a seemingly small change still blocks readers on some engines.

Rollback: forward-only vs reversible migrations

Frameworks often generate up and down methods. In production, down migrations are frequently untested or unsafe — especially after data-transform steps. Mature teams prefer forward-only migrations: if deploy fails, fix forward with a new migration rather than reversing DDL that may destroy data.

When rollback is required, prefer application rollback (revert the binary, keep schema expanded) over schema rollback (drop column). If you must reverse DDL, ensure the down path was exercised in staging with production-scale data. Destructive operations — drop column, drop table, shrink type — belong in the contract phase weeks after expand, with backups and verified replica lag at zero. Point-in-time recovery and logical backups are the real rollback for catastrophic mistakes, not a generated down() stub.

Multi-service, replicas, and sharded deployments

In a database-per-service layout, each service owns its migrations; no shared migration folder across teams. Cross-service schema coupling is an anti-pattern — use APIs or events instead of foreign keys across databases. When several services still share one legacy database (common during strangler migrations), coordinate expand-contract across service deploy order: expand schema first, deploy all writers to dual-write, then migrate reads, then contract once every service is updated.

Read replicas apply the same WAL or binlog as the primary. A heavy migration on the primary can leave replicas minutes behind; route critical reads accordingly or pause traffic to stale replicas until caught up. On sharded clusters, run the same migration on every shard with automation; a missed shard causes subtle split-brain bugs. For global sequences or uniqueness across shards, schema changes that add global constraints may require redesign rather than a simple ALTER.

Integrating migrations into CI/CD

Typical safe order: run migrations before or with the new app version that depends on them, never after old code is gone if the migration is expand-only. Many pipelines use a dedicated migration job with advisory locks so two deploys do not race. Blue-green deploys can migrate the idle color first, warm it, then switch traffic — but both colors must tolerate the schema during dual-write phases.

Block deploy on migration failure; alert on migration duration SLO breach. Store migration logs with deploy IDs for audit. For serverless or auto-scaling apps, ensure migration runs once per deploy, not once per new instance. Document which migrations require human approval (drops, table rewrites, backfills over N rows). The same pipeline that runs unit tests should fail if migration files conflict on version numbers or if drift is detected between ORM models and live schema.

Practical checklist

  • Classify the change — expand, data backfill, or contract; estimate row count and lock risk before writing SQL.
  • One migration per logical step — do not combine expand and contract in a single release.
  • Test on production-sized data — staging with 1,000 rows lies about lock time.
  • Batch backfills — rate-limit, monitor replica lag, resume safely.
  • Prefer online DDLCREATE INDEX CONCURRENTLY, inplace algorithms, avoid full table rewrites in peak traffic.
  • Plan rollback as forward fix — keep expanded columns until stable; avoid untested down migrations.
  • Coordinate multi-service deploys — dual-write windows documented in the runbook.
  • Backup and PITR verified — before contract-phase drops.

Key takeaways

  • Schema evolution is a deploy problem, not just a SQL problem — coordinate with application versions.
  • Expand-contract is the default pattern for breaking changes without downtime.
  • Data backfills dominate runtime — batch, throttle, and measure replica lag.
  • DDL locking varies by engine and table size — always measure on realistic data.
  • Forward-only migrations with backups beat automatic down scripts in production.
  • Sharded and replicated topologies multiply the blast radius of a missed or slow migration.

Related reading