Guide
Circuit breaker pattern explained: fail fast, recovery, and resilience
In a microservices stack, one slow dependency can stall every thread in every caller. Your checkout service waits on a recommendations API that is timing out; thread pools fill; the database connection queue backs up; the whole site returns 503 even though the payment processor is healthy. That is a cascading failure — and it is far more common than a hard crash. The circuit breaker pattern, popularized in production by Netflix Hystrix and now built into libraries from Resilience4j to Istio, stops calling a failing dependency and fails fast instead. After a cooling-off period it probes for recovery. This guide explains the three breaker states, how to tune trip thresholds, what fallbacks are worth building, and how breakers fit alongside timeouts, retries, bulkheads, and observability so your system degrades gracefully instead of collapsing together.
The problem breakers solve
Synchronous HTTP calls between services look innocent: service A calls service B and waits. When B is healthy, latency is a few milliseconds. When B is overloaded or its database is wedged, each call might hang for 30 seconds until a client timeout fires. If A has 200 worker threads and 200 concurrent requests are stuck waiting on B, A cannot serve anyone else — including users who never needed B at all.
Retries make this worse without guardrails. A client that retries three times on timeout triples load on an already struggling dependency. The classic fix stack is:
- Timeouts — cap how long you wait (necessary but not sufficient).
- Retries with backoff — recover from transient blips, not sustained outages.
- Circuit breakers — stop calling a dependency that is clearly unhealthy.
- Bulkheads — isolate thread pools so one dependency cannot exhaust all capacity.
Think of a breaker like an electrical circuit breaker in your house: when current spikes, it trips open so the wiring does not catch fire. Software breakers trip on error rate or slow-call rate, not amperage — but the goal is the same: contain damage and give the system time to recover.
Three states: closed, open, half-open
Every circuit breaker implementation revolves around the same state machine:
Closed (normal operation)
Requests pass through to the downstream service. The breaker records successes and failures (and often latency). While the failure rate stays below a configured threshold, nothing changes — this is the default steady state.
Open (tripped)
When failures exceed the threshold within a sliding window — for example, 50% of the last 20 calls failed, or 5 consecutive timeouts — the breaker opens. New calls fail immediately without hitting the network. The caller gets a fast error or a cached fallback response. A timer starts; after a wait duration (often 30–60 seconds) the breaker moves to half-open.
Half-open (probing)
A limited number of trial requests are allowed through — typically one or a small percentage. If they succeed, the breaker closes and normal traffic resumes. If any probe fails, the breaker opens again, usually with a longer wait. Half-open is how the system detects recovery without flooding a fragile dependency the moment it blinks online.
Libraries expose these states differently — Resilience4j uses
CLOSED, OPEN, HALF_OPEN; Envoy and Istio
surface outlier detection and passive health checks with similar semantics — but the
mental model transfers everywhere.
What counts as a failure?
Tuning starts with defining which outcomes trip the breaker:
- HTTP 5xx from the dependency — clear server-side failure.
- Timeouts — no response within your deadline; often the most damaging in practice.
- Connection refused / DNS failure — dependency is down or misconfigured.
- Slow calls — some libraries trip when latency exceeds a percentile threshold even if the call eventually succeeds; useful when thread exhaustion is the risk.
Do not trip on expected client errors. A 404 from a user lookup service is not a breaker event — it is a valid business outcome. A 429 rate limit from a partner API is trickier: tripping protects you, but you may want rate limiting on your side first and backoff instead of opening the breaker for quota exhaustion.
Minimum call volume matters. If you trip after “50% failures in the last 10 calls” but only 3 calls happened, noise dominates signal. Most libraries require a minimum number of calls in the window before evaluating the rate — set this high enough to avoid flapping on low traffic endpoints.
Fallbacks and degraded behavior
Opening the breaker without a plan still returns errors — just faster ones. Good fallbacks match user expectations:
- Cached stale data — show yesterday's product recommendations instead of none; label it as cached if regulations require freshness disclosure.
- Static defaults — a generic hero banner when personalization is down.
- Queue for later — accept the order, enqueue a message to fulfill when the dependency recovers; requires idempotent consumers.
- Fail the right operation — block checkout if payment is down; do not block browsing if reviews are down.
Fallback code must be simple and independently tested. A fallback that calls another flaky service just moves the cascade. Fallbacks should also have their own timeouts — reading from a local cache should not block on a lock held by a crashed thread.
Pairing with timeouts, retries, and bulkheads
Breakers are one layer, not the whole resilience story:
Timeouts first
A breaker that never sees failures because calls hang forever never trips. Set aggressive client timeouts slightly below upstream gateway limits. For RPC-heavy services (blockchain indexers, payment gateways), timeouts are often the first signal of saturation — our settlement stack rotates RPC endpoints when rate limits appear rather than retrying into a 429 storm.
Retries inside the closed state only
Retry transient errors (502, connection reset) with exponential backoff and jitter while the breaker is closed. Once open, retries are pointless — fail fast to the fallback. Never retry without idempotency on mutating operations.
Bulkheads isolate blast radius
A bulkhead dedicates a small thread pool or connection limit per dependency. If the recommendations client exhausts its 20 threads, checkout still has 180 threads for payment. Breakers stop new work; bulkheads cap concurrent work. Use both on critical paths.
Load balancers and health checks
Load balancers remove unhealthy instances from rotation — a coarser, infrastructure-level breaker. Application breakers react faster to dependency-specific failures (one bad endpoint behind a healthy load balancer) and can carry business-aware fallbacks load balancers cannot.
Where to put breakers
Place breakers at integration boundaries — every outbound call to another team's service, a third-party API, or a shared database read replica that can lag. Common placements:
- HTTP client wrappers (OkHttp interceptors, axios middleware, fetch wrappers).
- Service mesh sidecars (Envoy, Linkerd) — uniform policy without per-language libraries.
- API gateway routes to external partners.
- Database connection pools for optional read paths (not for the primary write DB on the critical path unless you have a replica fallback).
One breaker per logical dependency, not per URL path — unless paths have wildly different failure profiles. Sharing a breaker across unrelated APIs means a broken analytics beacon can block login.
Observability and operations
Breakers you cannot see are breakers you forget exist. Export metrics:
- State transitions (closed to open, open to half-open) — alert on open.
- Call volume allowed vs rejected while open.
- Failure rate in the sliding window.
- Fallback invocation count — high fallback rate means user-visible degradation even if HTTP 200.
Log state changes with dependency name and threshold that tripped. During incidents, operators manually force-open breakers to stop retry storms, or force-close to test recovery — document runbooks for both. Pair breaker dashboards with distributed traces so you can answer “did checkout fail because payment opened or because inventory opened?”
Anti-patterns
- Breaker on everything including localhost — adds complexity without protecting real boundaries.
- Thresholds copied from a blog post — 50% / 10 calls is a starting point; tune per dependency SLO and traffic volume.
- Opening on 4xx business errors — trips during normal “user not found” flows.
- Fallback that blocks — synchronous fallback to another HTTP call defeats the purpose.
- No half-open probes — manual redeploys to reset breakers do not scale.
- Breaker without timeout — threads still exhaust waiting; the breaker never gets failure signals in time.
Production checklist
- List outbound dependencies ranked by blast radius if they fail.
- Set client timeouts per dependency; verify they fire before gateway timeouts.
- Wrap high-risk calls with a breaker: failure-rate threshold, minimum volume, wait duration, half-open probe count.
- Define fallbacks for user-visible paths; return explicit degraded-state headers or UI copy where honesty matters.
- Add bulkhead limits on thread pools shared across dependencies.
- Export breaker state metrics and alert when any breaker stays open beyond N minutes.
- Load-test with dependency failure injection (chaos tools or toxiproxy) before production traffic discovers gaps.
- Document manual override procedures for on-call — force open during partner incidents, force half-open to test recovery.
Key takeaways
- Circuit breakers stop calling unhealthy dependencies and fail fast instead of queuing unbounded timeouts.
- Closed, open, and half-open states control normal traffic, outage isolation, and recovery probing.
- Trip on timeouts and 5xx with minimum call volume; do not trip on expected 4xx business responses.
- Fallbacks should be simple, tested, and sometimes queue work for async recovery.
- Combine breakers with timeouts, limited retries, bulkheads, and metrics — no single pattern is enough.
Related reading
- Microservices architecture explained — service boundaries and where resilience patterns belong
- Load balancing explained — health checks and infrastructure-level traffic shedding
- Observability explained — metrics and alerts for breaker state transitions
- Message queues explained — async fallbacks when synchronous paths are open