Guide

Change data capture (CDC) explained

A user updates their profile in your Postgres database. Elasticsearch still shows the old display name. Your analytics warehouse missed the change because the nightly ETL ran before the edit. Your Redis cache serves stale data until TTL expires. The root problem is the same: multiple systems need to reflect what lives in your OLTP database, and ad-hoc sync scripts always fall behind. Change data capture (CDC) solves this by reading the database's own transaction log — the write-ahead log (WAL) in Postgres, the binlog in MySQL — and streaming every row-level insert, update, and delete to downstream consumers in near real time. This guide covers CDC patterns, how tools like Debezium work, when CDC beats the transactional outbox, ordering and schema evolution pitfalls, and a checklist before you ship a log-based pipeline into production.

What change data capture is

CDC is the practice of identifying and delivering row-level changes from a source database to one or more targets. Each event typically carries the operation type (INSERT, UPDATE, DELETE), the affected table, primary-key values, and either the new row image, the old image, or both. Downstream systems — search indexes, data warehouses, cache invalidation workers, audit logs — apply those events without querying the source on every change.

CDC is not the same as database replication. Replication copies physical or logical pages to a standby that speaks SQL; CDC emits change events that arbitrary consumers can transform. You might replicate Postgres to a read replica and run CDC from the same WAL to feed Kafka — the purposes differ even when the underlying log is shared.

CDC approaches compared

Teams reach for CDC through several mechanisms. Each has different latency, load, and correctness trade-offs.

Timestamp polling

A cron job runs SELECT * FROM orders WHERE updated_at > :watermark every minute. Simple to build, but it misses hard deletes (the row is gone), creates read load proportional to table size, and delivers batch latency. High-churn tables force ever-larger scans unless you index updated_at carefully.

Database triggers

An AFTER INSERT OR UPDATE OR DELETE trigger writes to a shadow change_log table or calls an external queue. Triggers run inside the transaction — they add write latency and can deadlock under load. They also couple change capture to schema migrations: every new table needs trigger maintenance.

Application-level events

Your service publishes a domain event after every write. Clean when you control all writers, but any code path that bypasses the service — a migration script, a support tool, a second microservice — silently skips events. The transactional outbox fixes atomicity by writing events in the same DB transaction as the business row, then relaying them asynchronously.

Log-based CDC

A dedicated connector reads the database's replication log as a logical subscriber. Postgres logical decoding, MySQL binlog in ROW format, SQL Server CDC tables — the engine already records every committed change. Log-based CDC adds minimal overhead to the OLTP path, captures all writers including ad-hoc SQL, and streams continuously rather than polling. This is what most teams mean when they say "CDC" in production.

How log-based CDC works

When your application commits a transaction, the database appends redo records to its WAL or binlog before acknowledging success. A CDC connector registers as a logical replication slot (Postgres) or binlog replica (MySQL) and tails that stream. For each committed change, it decodes the row images, serializes them — often as JSON or Avro — and publishes to a message bus like Kafka, Amazon Kinesis, or Google Pub/Sub.

Debezium is the most widely deployed open-source CDC stack. It runs as Kafka Connect source connectors: one connector per database instance, emitting topics named server.schema.table. Consumers subscribe to the topics they need, apply transforms (flatten nested JSON, mask PII), and write to Elasticsearch, Snowflake, or in-memory caches. Debezium tracks its position in the log; on restart it resumes from the last committed offset, so brief connector outages do not lose data as long as the replication slot is retained.

Other tools fill similar roles: AWS DMS for managed migration and ongoing replication, Maxwell's daemon for MySQL-to-Kafka, LinkedIn's Brooklin, and cloud-native options like Google Datastream. The architecture pattern is consistent — log tail, decode, publish — even when the transport differs.

Common CDC use cases

  • Search indexing — Keep Elasticsearch or OpenSearch in sync with product catalog or user profile tables without dual writes from the application tier.
  • Cache invalidation — A lightweight consumer listens for updates to hot keys and deletes or refreshes Redis entries within seconds instead of waiting for TTL.
  • Analytics and warehousing — Stream OLTP changes into BigQuery, Snowflake, or Redshift for near-real-time dashboards instead of nightly batch dumps.
  • Cross-service propagation — In an event-driven architecture, CDC turns any legacy table into an event source without rewriting every stored procedure.
  • Audit and compliance — Append-only change streams provide a tamper-evident history of who changed what and when.
  • Read-model projection — CQRS write models stay in Postgres; CDC builds denormalized read models in document stores or graph databases.

CDC vs transactional outbox

Both solve "publish an event when data changes," but from opposite directions. The outbox is application-initiated: your code writes a business row and an outbox row in one transaction; a relay polls or tails the outbox table and publishes. You control event shape, versioning, and which changes are worth emitting. CDC is database-initiated: every committed row change becomes an event whether or not application code anticipated it.

Choose outbox when you need rich domain events ("OrderShipped" with computed fields) and every writer goes through your service layer. Choose CDC when you must capture all changes including migrations and admin scripts, when retrofitting event publishing into a brownfield monolith is impractical, or when the consumer wants raw table mirrors rather than curated domain language. Many mature systems use both: CDC for analytics and search sync, outbox for inter-service sagas that require business semantics.

Ordering, delivery, and exactly-once

CDC events inherit the commit order of the source database within a single table partition, but global ordering across tables is not guaranteed unless you route all events for one aggregate through a single Kafka partition keyed by primary key. Consumers must be idempotent: network retries and connector restarts can deliver duplicates. Use primary-key upserts in Elasticsearch, idempotent consumer offsets, or deduplication tables keyed by (table, pk, lsn).

True exactly-once end-to-end — source commit to downstream side effect exactly once — requires transactional sinks (Kafka transactions writing to an external store in the same commit) or careful idempotency design. Most teams accept at-least-once delivery with idempotent handlers, which is simpler and sufficient for search indexes and caches.

Deletes and tombstones

Hard deletes are easy to mishandle. A CDC DELETE event must propagate as a tombstone in Kafka (null value with the same key) or an explicit delete API call in Elasticsearch. Soft deletes (deleted_at column) appear as UPDATE events — consumers must interpret the flag. Document which tables use which pattern before wiring consumers.

Schema evolution and DDL

Adding a column is usually safe: new fields appear in event payloads and consumers ignore unknown keys. Renaming columns, changing types, or splitting tables breaks downstream parsers. Debezium emits schema-change events when configured; Avro with Schema Registry enforces compatibility rules (backward, forward, full). Plan migrations as expand-contract: add new column, dual-write or backfill, switch consumers, drop old column — never rename in place on a live CDC stream without a coordinated cutover.

Large DDL operations (adding an index concurrently, rewriting a partition) can stall logical replication slots. Monitor slot lag — if the connector falls behind and WAL segments are recycled before catch-up, you must rebuild the snapshot from scratch.

Failure modes and operations

  • Replication slot bloat — A stopped connector lets Postgres retain WAL indefinitely, filling disk. Alert on slot lag and pg_replication_slots size.
  • Initial snapshot load — First connect reads the full table (consistent snapshot) before tailing the log. Schedule during low traffic; throttle parallel table scans.
  • High-volume tables — A single hot table can overwhelm Kafka partitions. Increase partition count, filter columns at the connector, or route to a dedicated topic.
  • Secrets and PII — CDC streams contain raw row data. Mask columns in transforms, encrypt topics at rest, and restrict consumer ACLs.
  • Failover events — Primary database promotion may reset log positions. Connectors need reconfiguration or automatic leader detection to avoid gaps or duplicates across failover.

Production checklist

  1. Confirm log-based CDC is enabled on the source (Postgres wal_level=logical, MySQL binlog_format=ROW).
  2. Inventory every writer to the captured tables — CDC catches migrations and admin scripts the app team forgot about.
  3. Size replication slots and disk retention for worst-case connector downtime (hours, not minutes).
  4. Define event schema versioning and compatibility rules before the first consumer ships.
  5. Partition Kafka topics by primary key when per-entity ordering matters.
  6. Implement idempotent consumers — assume duplicates on restart and retry.
  7. Handle DELETE events explicitly in every downstream sink.
  8. Monitor connector lag, slot WAL retention, and consumer offset drift.
  9. Run an initial snapshot in staging with production-scale row counts to estimate duration.
  10. Document failover runbook: who resets offsets, when to rebuild snapshot vs resume tailing.

Key takeaways

  • CDC streams row-level changes from the database transaction log — inserts, updates, and deletes — without polling or application hooks.
  • Log-based CDC (Debezium, DMS, Maxwell) is the production default: low OLTP overhead, captures all writers, near-real-time latency.
  • Use cases span search, caches, warehouses, and event backbones — anywhere derived data must track the OLTP source of truth.
  • CDC complements the transactional outbox — raw table mirroring vs curated domain events; many systems need both.
  • Operate for at-least-once delivery — idempotent consumers, tombstone handling, slot lag alerts, and careful schema migrations.

Related reading