Guide

Exponential backoff and retry patterns explained

A payment API blips for eight seconds during a database failover. Your checkout service retries immediately — and so do four hundred other instances, each firing three attempts in the same second. The recovering database never gets a breath; the blip becomes a twenty-minute outage. That is a retry storm: well-intentioned fault tolerance that amplifies failure instead of absorbing it. Retries are essential for transient errors — network timeouts, brief 503 responses, leader election gaps — but only when spaced intelligently. Exponential backoff increases delay between attempts; jitter randomizes those delays so clients do not synchronize. This guide covers backoff formulas, jitter strategies, retry budgets, which errors deserve another try, idempotency requirements for safe replays, pairing with circuit breakers, and a production checklist for resilient clients and servers.

Why retries exist — and when they hurt

Distributed systems fail in short bursts. TCP retransmits packets; load balancers drain unhealthy nodes; cloud APIs return 503 Service Unavailable while autoscaling catches up. A single request that fails once often succeeds on the second attempt — if the underlying problem was transient.

Retries hurt when:

The error is permanent — validation failures, auth denials, and 404 responses will not heal with delay.
The operation is not idempotent — charging a card twice because the first response timed out is worse than failing once.
Many clients retry in lockstep — synchronized retries recreate the overload that caused the failure.
Retries lack a ceiling — unbounded attempts keep pressure on a recovering dependency indefinitely.

Good retry design answers four questions before the first failure: Is this error retryable? Is the handler safe to run again? How long should we wait? When do we give up and surface failure to a human or a dead letter queue?

Fixed delay vs exponential backoff

The simplest retry loop waits a constant interval — say one second — between attempts. Fixed delay is easy to reason about but dangerous at scale: every client that failed at time T retries at T + 1s, creating periodic traffic spikes that hammer the recovering service on a metronome.

Exponential backoff multiplies the wait after each failure. A common pattern:

delay = min(cap, base * 2^attempt)

With base = 100ms, cap = 30s, and up to five attempts, delays grow roughly 100ms → 200ms → 400ms → 800ms → 1.6s before hitting the cap. Early retries catch fast recoveries; later attempts back off aggressively when the outage persists.

Choose base large enough to avoid hammering a service in its first milliseconds of recovery, and cap small enough that total user-visible latency stays within your SLO. Document max attempts per operation class — a read can afford more tries than a financial write.

Jitter: breaking synchronization

Exponential backoff alone still leaves clients aligned if they failed at the same moment — a deploy, a regional blip, or a shared dependency outage synchronizes thousands of clocks. Jitter adds randomness so retry times spread across a window.

Full jitter (recommended default)

AWS popularized picking a uniform random delay between zero and the calculated backoff:

sleep = random(0, min(cap, base * 2^attempt))

Full jitter minimizes collision probability and is the safest default for client SDKs talking to shared infrastructure.

Equal jitter

Half the calculated delay plus random noise in the upper half: delay/2 + random(0, delay/2). Keeps average wait closer to the exponential curve while still desynchronizing clients.

Decorrelated jitter

Each wait depends on the previous sleep, not just the attempt count: sleep = min(cap, random(base, previous_sleep * 3)). Useful when attempt numbers are unreliable (message visibility timeouts that reset) or when you want faster spread without strict powers of two.

Whichever variant you choose, log the attempt number, chosen delay, and error class — without logging secrets or full payloads.

Retry budgets and giving up

A retry budget caps how much retry traffic your system generates — globally or per dependency. Google's SRE practice limits retries to a fraction of total request volume so a failing backend cannot be drowned by its own clients' goodwill.

Practical limits to set:

Max attempts — typically 3–5 for synchronous HTTP; more for async workers with durable queues.
Total deadline — wall-clock timeout across all attempts (e.g. 10s for an interactive API call).
Per-dependency concurrency — cap simultaneous in-flight retries so one slow service does not exhaust your thread pool.
Retry-After respect — when a server returns Retry-After, honor it instead of your own schedule (within reason).

After the budget is exhausted, fail visibly: return an error to the user, enqueue for later processing, or route to a DLQ. Silent infinite retry loops are how stuck messages and zombie jobs accumulate.

Retryable vs terminal errors

Not every non-200 status deserves another attempt. A useful rule of thumb for HTTP clients:

Retry — 408 Request Timeout, 429 Too Many Requests (with backoff honoring rate limits), 500, 502, 503, 504, and connection resets where the request may not have reached the server.
Do not retry — 400 bad input, 401/403 auth, 404 not found, 409 conflict (unless your app defines idempotent upsert semantics), most 422 validation errors.

For idempotent GET and HEAD, retries are generally safe. For POST, assume unsafe unless you send an idempotency key the server deduplicates. For PUT and DELETE with stable resource IDs, retries are often safe; for PATCH, depends on whether the patch is absolute or relative.

Message consumers should classify exceptions the same way: network blips and throttling → retry with backoff; schema violations and business-rule rejections → terminal, send to DLQ after N receives.

Idempotency: the non-negotiable prerequisite

A timeout is ambiguous: the server may have succeeded and the response was lost, or the server never ran the handler. Retrying without idempotency guarantees duplicates — double charges, duplicate shipments, two ledger entries for one trade.

Production patterns:

Idempotency keys — client sends Idempotency-Key: uuid; server stores outcome keyed by that ID for 24–72 hours.
Natural idempotency — PUT /users/42 with full representation replaces the same state regardless of repeat count.
Deduplication tables — store processed event IDs for async consumers; skip duplicates on redelivery.
Compare-and-swap — only apply if version or timestamp matches; stale retries no-op safely.

If you cannot make an operation idempotent, do not retry it blindly — use outbox polling, human reconciliation, or a saga with compensating transactions instead.

Where retries live: client, proxy, or broker

Retries can happen at multiple layers; duplicating them multiplies load.

Client SDK retries

Application code or the HTTP client library retries failed calls. Gives fine-grained control per API but risks every service inventing different policies.

Service mesh / API gateway

Envoy, Linkerd, or an API gateway may retry idempotent routes automatically. Centralized policy is powerful but dangerous for non-idempotent POST unless explicitly excluded.

Message broker redelivery

SQS visibility timeout, RabbitMQ nack-with-requeue, and Kafka consumer offset rewind all implement retries asynchronously. Pair broker redelivery with exponential backoff via delayed queues or tiered retry topics — not immediate hot loops.

Pick one primary retry layer per hop. If the client retries three times and the gateway retries three times, you have nine attempts hitting a fragile backend.

Pairing backoff with circuit breakers and rate limits

Retries and circuit breakers solve opposite phases of the same outage. Backoff helps during brief transients; the circuit opens when failure rate proves the dependency is down, failing fast instead of wasting slots on doomed attempts.

Typical combination:

First failure → exponential backoff retry (small number of tries).
Sustained failures → circuit trips open; calls return immediately or use a cached fallback.
After a cool-down → half-open probe with a single attempt; success closes the circuit.

Edge rate limiting protects shared APIs from abusive retry volume. Return 429 with Retry-After so well-behaved clients back off instead of tight-looping.

HTTP semantics: Retry-After and idempotent methods

RFC 9110 defines Retry-After as either a delay in seconds or an HTTP-date when the client should try again. Servers under load should set it on 503 and 429 responses so clients do not guess.

Clients should parse both forms, clamp unreasonable values, and add jitter even when honoring Retry-After — thousands of clients receiving Retry-After: 5 will still collide at second five without noise.

For long-running operations, prefer 202 Accepted with a status poll URL or webhook callback over holding a connection open through multiple internal retries.

Common failure modes

Retrying non-idempotent POST on timeout — classic double-charge bug; always use idempotency keys for payments and orders.
No jitter on shared outage — recovery window gets hammered by synchronized wave two.
Retrying 429 as fast as possible — ignores rate-limit intent; honor Retry-After and reduce concurrency.
Nested retries across layers — multiplicative attempt count; document and disable duplicate layers.
Retrying through an open circuit — wastes resources; check breaker state before scheduling backoff.
Poison message infinite requeue — terminal errors must land in DLQ after max receives, not loop forever.

Production checklist

Classify every outbound call and consumer handler as retryable or terminal; document the decision.
Require idempotency keys (or natural idempotency) before enabling retries on mutating operations.
Implement exponential backoff with full jitter as the default client policy.
Set max attempts, per-call deadline, and optional per-dependency retry budget.
Honor Retry-After on 429 and 503; add jitter on top.
Pair retries with circuit breakers — stop retrying when the breaker is open.
Ensure only one layer retries per hop (client OR gateway OR broker, not all three).
Emit metrics: retry count, backoff delay histogram, exhausted-budget rate, duplicate-detection hits.
Route exhausted async retries to a DLQ with alert and redrive runbook.
Load-test recovery: simulate dependency outage and verify clients desynchronize and recover without retry storm.

Key takeaways

Retries absorb transients; exponential backoff prevents them from becoming storms.
Jitter is not optional at scale — without randomness, backoff still synchronizes clients.
Idempotency comes before retry policy — ambiguous timeouts on writes need deduplication, not blind repeats.
Classify errors — retry timeouts and 5xx; fail fast on 4xx validation and auth.
Circuit breakers and rate limits complement backoff — know when to stop retrying entirely.