Guide

Saga pattern explained

A user places an order. In a monolith, inventory, payment, and shipping updated inside one database transaction. Split those concerns into three microservices and you lose a single ACID boundary. The saga pattern replaces one giant transaction with a sequence of local commits, each followed by the next step or a compensating action when something fails. There is no global lock holding three databases hostage while a payment gateway times out. Instead you accept brief eventual consistency and design explicit undo logic. This guide covers choreography versus orchestration, compensating transactions, the transactional outbox, idempotency requirements, workflow engines, stuck-saga recovery, and a production checklist teams use before shipping money paths to production.

Why two-phase commit rarely survives contact with microservices

Two-phase commit (2PC) coordinates participants through a prepare phase and a commit phase. Every service holds locks until the coordinator says commit. That works inside a single database engine. Across independently deployed HTTP services it couples availability: if inventory is slow, payment waits; if the coordinator crashes between phases, participants block until an operator intervenes.

Product teams shipping on weekly deploy cadences usually reject 2PC for cross-service flows. Services need autonomy — separate databases, separate failure domains, separate scaling curves. Sagas make the trade explicit: no atomicity across service boundaries, but forward progress without blocking the whole platform on one slow peer. Each step commits locally; failure triggers semantic undo instead of a shared rollback log. The pattern is decades old (Hector Garcia-Molina's 1987 paper) but became essential once event-driven microservices replaced most greenfield monoliths.

Choreography vs orchestration

A saga decomposes a business process into transactions — local ACID units — linked by events or commands. Two coordination styles dominate production systems:

Choreography

Services publish domain events; peers subscribe and react without a central brain. Order service emits OrderPlaced; payment listens and emits PaymentCaptured; shipping listens next. No single coordinator owns state. Choreography fits teams already invested in message queues and loose coupling. At three services it is elegant. At fifteen, debugging "who should have reacted to PaymentFailed?" becomes an archaeology exercise through distributed logs.

Orchestration

A dedicated saga coordinator — a state machine table, or a workflow engine like Temporal, Camunda, or AWS Step Functions — issues commands in order: reserve inventory, charge card, create shipment. The coordinator stores saga instance state, retries stalled steps with backoff, and runs compensations in reverse order on failure. Orchestration trades coupling for visibility: auditors see a state diagram; on-call engineers query one table for "where is order 48291 stuck?" The orchestrator itself must be scaled, backed up, and protected like any other tier-1 service.

Most mature platforms mix both: orchestration for regulated money paths where every step needs a timeout and audit trail; choreography for notifications, analytics, and search indexing where duplicate delivery is tolerable with idempotent handlers.

Compensating transactions

Database rollback undoes writes inside one engine. Across services, undo is semantic, not automatic. Charging a card can be refunded; reserving warehouse stock can be released; sending email cannot be unsent — you send a correction instead. Every forward step needs a designed compensation classified as:

Reversible — refund payment, cancel reservation, void shipping label. Must be idempotent: calling refund(orderId) twice must not double-refund.
Irreversible — email sent, package handed to carrier, on-chain transaction confirmed. Compensate with a follow-up action rather than pretending the first never happened.
Pivot — alternate fulfillment when primary warehouse is empty instead of pure undo.

Compensations run in reverse order of successful forward steps. If payment succeeded but shipping failed, compensate payment before releasing inventory you never reserved. Saga state must record which steps completed so you do not compensate steps that never ran — and so retries do not re-execute completed forward work without idempotency keys.

The transactional outbox

The naive approach — update Postgres, then call producer.send() — has two failure modes: DB commits but the process crashes before Kafka acks (downstream never learns); or Kafka acks but DB rolls back (consumers act on ghost events). The transactional outbox writes the event into an outbox table in the same database transaction as the business row:

CREATE TABLE outbox (
  id           UUID PRIMARY KEY,
  aggregate_id UUID NOT NULL,
  event_type   TEXT NOT NULL,
  payload      JSONB NOT NULL,
  created_at   TIMESTAMPTZ DEFAULT now(),
  published_at TIMESTAMPTZ
);

A relay process (polling worker or CDC via Debezium) reads unpublished rows, publishes to the broker, then marks published_at. If the relay crashes after publish but before mark, the next poll may republish — consumers must dedupe by outbox.id or business key. If it crashes before publish, the row retries. The business row and outbox row appear together or not at all. Sagas depend on reliable handoff between services; outbox is the plumbing most choreography and orchestration designs share.

Saga state machines and workflow engines

Hand-rolled saga tables work for early products: columns for saga_id, current_step, status (running, compensating, completed, failed), and JSON context. As flows grow, workflow engines reduce boilerplate:

Temporal / Cadence — durable execution with automatic retries, timers, and compensation as first-class activities. Survives worker crashes mid-step.
Camunda / Zeebe — BPMN diagrams auditors can read; good for regulated industries.
AWS Step Functions — managed state machines with visual debugging; pairs naturally with Lambda and SQS.

Whether custom or engine-driven, every saga instance needs a correlation ID propagated through outbox payloads, HTTP headers, and logs so support can trace one customer order across fifteen services. Without correlation, "order stuck" tickets are impossible to debug.

What users see: pending states and timeouts

Sagas are eventually consistent. Between steps, a user might see "payment processing" while inventory is already reserved. Product UX must reflect saga state, not assume instant global truth:

Pending states — never show "shipped" until the saga reaches a terminal success state.
Read-your-writes — route the user's next GET to the service that owns the saga instance they just mutated.
Timeouts and escalation — if payment exceeds 30 seconds, mark saga failed and compensate automatically; page on-call if compensation itself fails (the dangerous stuck saga).
Dead-letter queues — poison messages after N retries land where operators inspect payload and replay or skip.

Pair saga timeouts with circuit breakers on external dependencies. Retry storms during compensation amplify outages when payment gateways are already degraded.

Stuck sagas: detection and manual recovery

The worst production incident is a saga stuck in compensating because refund succeeded but inventory release failed — leaving money returned but stock still reserved. Mitigations:

Dashboards counting sagas per state (running, compensating, completed, failed) with alerts when compensating count exceeds baseline.
SLA timers: auto-escalate sagas in non-terminal states beyond N minutes.
Admin replay tools: idempotent re-run of a single compensation step with audit log.
Reconciliation jobs: nightly compare payment ledger vs order status vs inventory reservations; flag orphans.

Observability is not optional. Trace IDs in outbox payloads, structured logs with saga_id and step, and metrics for compensation failure rate turn multi-hour mysteries into ten-minute fixes.

Common pitfalls

Missing compensations — every forward step documented with undo or explicit "irreversible" classification before code ships.
Compensating out of order — refund before canceling a shipment that never created a label.
Non-idempotent consumers — duplicate OrderPlaced creates two shipments.
Polling outbox without index — WHERE published_at IS NULL on millions of rows locks the table.
Dual-write without outbox — "we'll retry Kafka manually" becomes on-call pager debt.
Saga as distributed monolith — fifteen synchronous HTTP calls in one orchestrator recreates 2PC latency without 2PC guarantees.

Production checklist

Model happy path and every failure branch before writing code.
Store saga state (step, status, correlation ID) in a durable table or workflow engine.
Use outbox or CDC for every "DB then event" path; never dual-write bare.
Make forward and compensation handlers idempotent with business-key dedup.
Choose choreography for loose coupling; orchestration when you need a visible state machine.
Expose pending and failed states to users and support tools.
Alert on sagas stuck in compensating or retry loops beyond SLA.
Run reconciliation jobs to catch drift between services.

Key takeaways

Sagas replace cross-service 2PC with local commits plus compensating steps.
Choreography scales event-native architectures; orchestration fits regulated flows needing visible state.
The transactional outbox makes "update DB and publish event" one logical unit.
Idempotency is mandatory under at-least-once delivery — forward and compensation handlers alike.
Stuck saga alerts and reconciliation jobs prevent silent money and inventory drift.