Explainer · 7 June 2026

How the saga pattern and transactional outbox work

A user clicks Place order. Your monolith used to debit inventory, charge the card, and send a confirmation email inside one database transaction. Split that into three microservices and you lose a single ACID boundary. Sagas replace one big transaction with a sequence of local commits plus compensating steps when something fails. The transactional outbox solves a narrower but equally painful problem: how to atomically update your database and publish an event to Kafka without losing messages or double-sending them.

Why distributed transactions break down

Two-phase commit (2PC) asks every participant to prepare, then commit together. It gives you atomicity across services — until a coordinator crashes mid-commit, a network partition isolates a participant, or one service's lock is held for seconds while another team deploys. 2PC blocks, does not compose well with HTTP timeouts, and couples availability: if the inventory service is slow, payment waits.

Most product teams eventually accept a harder truth: there is no global transaction across independently deployed services. You trade atomicity for autonomy. Sagas make that trade explicit. Each step commits locally; failure triggers undo logic instead of rolling back a shared log. The outbox pattern handles the plumbing — reliable handoff from database state to asynchronous consumers — so saga steps can react to events without polling every service.

Saga pattern: choreography vs orchestration

A saga is a long-running business process decomposed into transactions (local ACID units), each followed by the next step or a compensating action. Two coordination styles dominate:

Choreography — services publish domain events; peers subscribe and react. Order service emits OrderPlaced; payment listens and emits PaymentCaptured; shipping listens next. No central brain. Simple at three services; debugging "who should have reacted?" gets painful at fifteen.
Orchestration — a dedicated saga coordinator (state machine or workflow engine like Temporal, Camunda, or a custom table) issues commands: "reserve inventory," then "charge card," then "create shipment." The coordinator stores saga state, retries stalled steps, and runs compensations in reverse order on failure. More visibility; the orchestrator is a single point you must scale and harden.

Choreography fits event-native architectures already built on event-driven design. Orchestration fits regulated flows where auditors want a visible state diagram and explicit timeouts per step. Many production systems mix both: orchestration for the money path, choreography for notifications and analytics.

Compensating transactions

Database rollback undoes writes inside one engine. Across services, undo is semantic, not automatic. Charging a card can be refunded; reserving warehouse stock can be released; sending email cannot be unsent — you send a correction instead. Each forward step needs a designed compensation:

Reversible — refund payment, cancel reservation, void label. Must be idempotent: calling refund(orderId) twice should not double-refund.
Irreversible — email sent, package handed to carrier, blockchain tx confirmed. Compensate with a follow-up action (apology email, return label) rather than pretending the first never happened.
Pivot — alternate fulfillment path when primary warehouse is empty instead of pure undo.

Compensations run in reverse order of successful forward steps. If payment succeeded but shipping failed, compensate payment before releasing inventory you never reserved. Saga state must record which steps completed so you do not compensate steps that never ran — and so retries do not re-execute completed forward work without idempotency keys.

The transactional outbox pattern

The naive approach — update Postgres, then call producer.send() — has two failure modes:

DB commits, process crashes before Kafka ack → downstream never learns the order was placed.
Kafka acks, DB rolls back → consumers act on an event for data that does not exist.

The transactional outbox fixes this by writing the event into an outbox table in the same database transaction as the business row. Schema sketch:

CREATE TABLE outbox (
  id          UUID PRIMARY KEY,
  aggregate_id UUID NOT NULL,
  event_type  TEXT NOT NULL,
  payload     JSONB NOT NULL,
  created_at  TIMESTAMPTZ DEFAULT now(),
  published_at TIMESTAMPTZ
);

A separate relay process (polling worker or log-based CDC like Debezium) reads unpublished rows, publishes to the message broker, then marks published_at. If the relay crashes after publish but before mark, the next poll may republish — so consumers must be idempotent (dedupe by outbox.id or business key). If the relay crashes before publish, the row stays unpublished and will retry. The business row and outbox row appear together or not at all.

Variants include the inbox pattern on the consumer side (record processed message IDs before handling) and change-data-capture (CDC) streams that treat the database WAL as the source of truth instead of a polling loop. All share the same goal: make "state change" and "notify the world" one logical unit.

At-least-once delivery and idempotent handlers

Message brokers guarantee at-least-once delivery under retries, not exactly-once end-to-end across DB and bus without specialized infrastructure. Saga and outbox designs embrace that:

Forward handlers check a processed-events table or natural key uniqueness before side effects.
Compensation handlers check whether the forward step actually succeeded before undoing.
Outbox relays use monotonic cursor or published_at IS NULL queries with row-level locking to avoid duplicate publish from concurrent workers.

Pair outbox IDs with idempotency keys on HTTP callbacks for hybrid flows (saga step calls external API, stores result). Payment processors and blockchains often provide natural idempotency via client-supplied keys or unique on-chain signatures — the same discipline applies off-chain.

Sagas vs 2PC vs outbox-only

These tools solve overlapping problems at different layers:

2PC — strong atomicity across participants willing to block; rare in greenfield microservices; still appears inside databases (XA) and some legacy ERP bridges.
Saga — business-level consistency via forward steps + compensations; accepts temporary inconsistency visible to users (order "pending" while payment retries).
Transactional outbox — reliable event emission from one service; does not coordinate multi-service logic by itself but is the transport layer most sagas depend on.

A typical checkout saga: Order service writes order + outbox event in one TX; relay publishes OrderPlaced; payment service consumes, charges, writes PaymentCaptured to its outbox; shipping subscribes. Failure at payment triggers orchestrator or payment's compensating PaymentFailed event; order service listens and marks order cancelled — no 2PC, no cross-DB locks.

Consistency models users actually see

Sagas are eventually consistent. Between steps, a user might see "payment processing" while inventory is already reserved. Product and support tooling must reflect saga state, not assume instant global truth. Common mitigations:

Pending states in UI — never show "shipped" until the saga reaches a terminal success state.
Read-your-writes — route the user's next GET to the service that owns the saga instance they just mutated.
Timeouts and escalation — if payment step exceeds 30s, mark saga failed and compensate automatically; alert if compensation itself fails (the dangerous "stuck saga").
Dead-letter queues — poison messages after N retries land where operators can inspect payload and replay or skip.

Observability matters: trace IDs propagated through outbox payloads, saga instance IDs in logs, and dashboards for count of sagas per state (running, compensating, completed, failed). Without them, "order stuck" tickets are impossible to debug.

Common pitfalls

Missing compensations — every forward step documented with undo or explicit "irreversible" classification.
Compensating out of order — refund before canceling a shipment that never created a label.
Non-idempotent consumers — duplicate OrderPlaced creates two shipments.
Polling outbox without index — WHERE published_at IS NULL on millions of rows locks the table; index and batch size matter.
Dual-write without outbox — "we'll retry Kafka manually" becomes on-call pager debt.
Saga as distributed monolith — fifteen synchronous HTTP calls in one orchestrator recreates 2PC latency without 2PC guarantees.
Ignoring circuit breakers — retry storms during compensation amplify outages.

Practical checklist

Model the happy path and every failure branch before writing code.
Store saga state (step, status, correlation ID) in a durable table or workflow engine.
Use outbox (or CDC) for every "DB then event" path; never dual-write bare.
Make forward and compensation handlers idempotent with business-key dedup.
Choose choreography for loose coupling; orchestration when you need a visible state machine.
Expose pending and failed states to users and support tools.
Alert on sagas stuck in compensating or retry loops beyond SLA.