Explainer · 7 June 2026

How circuit breakers and resilience patterns work

Your checkout service calls a fraud-scoring API on every order. The scorer degrades — p99 latency climbs from 40 ms to eight seconds. Threads pile up waiting. Connection pools exhaust. The checkout service itself starts timing out, which triggers retries, which multiplies load on the already-sick scorer. Within minutes the failure has climbed the stack and customers see 503s on pages that never touched fraud scoring at all. A circuit breaker is the electrical metaphor for breaking that feedback loop: when error rates or latency cross a threshold, stop calling the dependency immediately and return a controlled fallback instead of joining the pile-up.

The three states: closed, open, half-open

Think of a breaker as a small state machine sitting on the client side of every outbound call (or in a service mesh sidecar).

Closed (normal) — requests flow through. The breaker counts failures, slow responses, or specific error types (5xx, connection refused, deadline exceeded). Successes decay the failure window.
Open (tripped) — calls fail fast without hitting the network. You return a cached value, a degraded response, or a clear error ("payments temporarily unavailable"). The dependency gets breathing room to recover instead of absorbing a retry storm.
Half-open (probing) — after a cooldown timer, allow a trickle of real requests through. If they succeed, close the breaker; if they fail, reopen immediately.

Netflix's Hystrix library popularized this pattern in microservices; today equivalents live in resilience4j, Polly (.NET), Envoy outlier detection, and cloud vendor SDKs. The names differ, but the state machine is the same.

What should trip the breaker?

Naive implementations count only HTTP 500 responses. Production systems usually combine signals:

Consecutive failures — simple, but one flaky success resets the counter while latency is still awful.
Sliding window error rate — e.g. trip when more than 50% of the last 20 calls fail within 10 seconds. Smooths bursty noise.
Slow-call rate — treat responses over 2 s as failures even if they eventually return 200. Slow is failed for capacity planning.
Specific exceptions — trip on ECONNREFUSED and timeouts, but not on 404 from a user mistake.

Tune thresholds per dependency. A read-through cache miss to a catalog service can tolerate higher error rates than a call that moves money. Document what "open" means for each path — users prefer honest degradation over hung spinners.

Timeouts first, breaker second

A circuit breaker does not replace deadlines. Without per-request timeouts, threads block until the OS gives up — the breaker never gets failure signals in time to help. Set an aggressive client timeout shorter than the server's own timeout so you fail locally and release resources.

Pair timeouts with rate limiting on the caller: even when the breaker is closed, cap how many concurrent requests you send to a fragile dependency. A load balancer may shed load at the edge, but the breaker is the last line inside your process before you become part of the problem.

Bulkheads: isolate blast radius

On a ship, watertight bulkheads keep one flooded compartment from sinking the vessel. In software, run different dependency classes on separate thread pools or connection pools. If the image-resize client trips a breaker, checkout threads should still be available for orders that do not need thumbnails.

Hystrix called these thread pool isolation vs semaphore isolation (lighter, same-thread). Service meshes achieve similar separation with distinct circuit breaker budgets per route. The goal is identical: one bad neighbor cannot exhaust the shared pool everyone else needs.

This pairs naturally with the CAP trade-off mindset — during partial outages you choose availability of core features by sacrificing optional paths, not by letting optional paths take down core ones.

Retries vs circuit breakers

Retries heal transient blips; breakers stop chronic wounds from getting worse. Used together, they need discipline:

Retry only idempotent operations (GET, or POST with idempotency keys — see REST design).
Use exponential backoff with jitter so ten thousand clients do not retry in sync.
Put retries inside the breaker window counting logic, or cap retry attempts so one user request does not register as five failures.
When the breaker is open, do not retry — fail fast to the fallback.

In event-driven systems, move retriable work to a queue with dead-letter handling instead of blocking HTTP retry loops. The breaker protects the synchronous hot path; the queue absorbs backlog asynchronously.

Fallbacks that are actually useful

"Return null" is not a fallback strategy. Good degradations include:

Stale cache — serve last-known-good catalog or exchange rate with a visible "prices may be delayed" banner.
Static defaults — feature flags that disable non-essential widgets when scoring or recommendations are down.
Queued acceptance — accept the order, return 202, fulfill when the dependency recovers (requires outbox or saga patterns).
Alternate provider — secondary RPC endpoint or read replica, with its own independent breaker so a double failure does not loop forever.

Crypto apps hitting Solana JSON-RPC see the same pattern: when the primary endpoint returns 429 or times out, a breaker opens on that host and traffic shifts to fallbacks — but only if each endpoint has isolated health tracking, not a single breaker for "the blockchain."

Half-open probes and thundering herds

When a breaker transitions from open to half-open, every instance in a fleet may probe at once — recreating the spike that tripped it. Mitigations:

Jittered cooldowns — stagger reopen timers per process.
Single-flight probing — only one goroutine/thread sends the probe; others wait for the result.
Gradual ramp — allow 1%, then 10%, then 100% of traffic after sustained success (similar to load balancer draining in reverse).

Observability matters: emit metrics for state transitions (breaker_open, breaker_half_open), failure rates, and fallback invocations. Alert on time-spent-open, not just on open events — a breaker flapping every thirty seconds indicates misconfigured thresholds or an unstable dependency.

Operational pitfalls

Shared breaker across unrelated calls — one bad query type opens the breaker for all database traffic. Scope per dependency and per operation class.
Fallbacks that call the same dependency — "fallback" that hits the same sick API through a different URL still fails.
Ignoring half-open success bias — three lucky probes during low traffic close the breaker before peak load returns.
Cascading opens — service A opens because B is down; C opens because A returns errors. Map dependency graphs and break cycles with async boundaries.
Testing only happy path — chaos experiments (kill a dependency, inject latency) should verify breakers trip and recover in staging with production-like concurrency.

Practical checklist

Every outbound network call has a deadline shorter than upstream SLAs.
Breakers are scoped per dependency + operation, with documented thresholds.
Fallback behavior is product-reviewed — not just "log and throw."
Retries use backoff + jitter and respect idempotency rules.
Metrics and dashboards show breaker state, not only error logs.
Chaos or fault-injection tests run before peak traffic seasons.