Guide

LLM agent circuit breaker and bulkhead resilience systems explained

Harbor Platform runs thousands of concurrent agent sessions across support, analytics, and internal ops. When their primary LLM provider entered a brownout — elevated latency, intermittent 503s, no hard outage page — the agent runtime kept retrying every completion and tool call with aggressive backoff. Worker threads piled up waiting on the same sick endpoint. Queue depth tripled in eight minutes; healthy tenants could not start runs because the shared worker pool was saturated. 89% of active sessions failed or timed out even though only one dependency was impaired. After engineering shipped per-dependency circuit breakers, bulkhead concurrency pools, and fast-fail routing into model fallback, outage amplification fell to 4% and p95 time-to-first-token for unaffected tenants stayed within normal bounds.

Circuit breakers stop calling a dependency that is already failing, giving it time to recover while your agent fails fast or routes elsewhere. Bulkheads cap how many concurrent operations each dependency (or tenant class) may consume so one slow path cannot exhaust shared workers. Together they complement retry and backoff — retries help transient blips; breakers prevent retry storms from becoming platform-wide collapse. This guide covers breaker state machines, bulkhead pool design, agent-specific failure signals, coordination with rate limits and DLQs, Harbor Platform’s refactor, a technique decision table, pitfalls, and a production checklist.

Why agents need breakers beyond generic microservices patterns

The general circuit breaker pattern applies to HTTP services, but LLM agent stacks add layers that share failure modes differently:

Completion providers — OpenAI-compatible APIs, self-hosted inference, regional replicas.
Embedding and reranker services — often separate rate limits and SLOs.
Tool backends — CRM, warehouse APIs, sandboxes, each with distinct timeout profiles.
Vector stores and memory — latency spikes that block retrieval steps before the model ever runs.
Long-running runs — a single agent session may chain dozens of calls; one sick tool poisons the whole trajectory.

A breaker keyed only to “the HTTP client” is too coarse. Instrument each dependency class with its own breaker and bulkhead. Coordinate with distributed tracing so open-breaker events appear as spans, not mysterious 30 s hangs.

Circuit breaker state machine

A standard three-state breaker suits most agent gateways:

Closed — requests flow normally; failures increment a sliding-window counter.
Open — after threshold (e.g. 50% errors over 30 s or 20 consecutive timeouts), fail fast without calling the dependency. Return a structured error the agent runtime can route to fallback or user messaging.
Half-open — after a cooldown (e.g. 30–120 s), allow a small probe batch (1–5 requests). Success closes the breaker; any failure reopens it.

Agent-specific trip signals

Do not trip only on HTTP 5xx. For LLM providers, also consider:

Latency SLO breach — p95 time-to-first-token above 8 s for three consecutive windows.
Empty or truncated completions — provider returns 200 with zero tokens (degenerate responses).
Rate-limit saturation — sustained 429s where client-side throttling cannot absorb the load.
Tool-specific errors — payment gateway hard declines vs network reset should not share one breaker if classification differs.

Publish breaker state to the agent middleware layer so planners can skip tools whose breakers are open instead of burning context budget on doomed retries.

Bulkhead concurrency pools

Bulkheads limit parallel in-flight calls per pool. Typical agent topology:

Global completion pool — ceiling on simultaneous LLM requests across the fleet.
Per-provider sub-pools — so Provider A saturation does not block Provider B fallback traffic.
Per-tenant fair-share pools — align with tenant isolation; one noisy neighbor cannot exhaust workers.
Tool-class pools — separate caps for sandbox execution vs read-only CRM lookups.
Background vs interactive lanes — batch eval jobs get a dedicated low-priority bulkhead.

When a bulkhead is full, prefer structured rejection (queue with ETA, degrade to cheaper model, surface “try again”) over unbounded waiting. Unbounded waits look like agent hangs and trigger duplicate user submissions — a secondary failure mode worse than the original outage.

Coordinating breakers with retry, fallback, and DLQ

Resilience patterns stack; they must not fight each other:

Retry before breaker counts — a single transient 503 should not trip the breaker; use classified retries with jitter first.
Open breaker skips retry — when open, do not enqueue three more attempts; route immediately to fallback tier or fail the step.
Fallback is not infinite — secondary providers need their own breakers; cascading fallback without caps recreates the storm.
DLQ for exhausted paths — runs that cannot complete after breaker + fallback should land in a dead-letter queue with the breaker state snapshot for ops replay.

Middleware ordering matters: rate limit → bulkhead acquire → breaker check → retry wrapper → actual call. Document this in your hook pipeline so new integrations do not bypass bulkheads.

User-visible degradation tiers

When breakers open, agents should degrade predictably:

Tier 0 — transparent fallback to secondary model; user sees no change if quality holds.
Tier 1 — disable non-critical tools (web browse, heavy sandbox); answer from cache and memory only.
Tier 2 — template response explaining partial outage; offer retry button tied to half-open probe success.
Tier 3 — hard fail with ticket ID; run persisted for DLQ replay when dependency recovers.

Tie tier selection to breaker scope: an open CRM breaker should not trigger Tier 3 for the entire chat if the model and other tools remain healthy.

Harbor Platform refactor (case study)

Before: one shared worker pool (512 threads), global retry policy (3 attempts, 2 s base backoff), no per-provider breakers, fallback model called only after full run timeout (120 s).

After:

Per-provider circuit breakers on completion, embedding, and top-10 tool integrations.
Bulkheads: 200 interactive + 80 batch completion slots; 40 slots reserved per premium tenant tier.
Half-open probes: 3 requests per 60 s cooldown; auto-close on 2/3 success.
Breaker-open events trigger immediate fallback routing, not post-timeout.
Dashboards: breaker state, pool utilization, shed-request count per tenant.
Runbooks link open breakers to canary rollback when bad deploy correlates with error spikes.

Results: during the next provider brownout, affected-session failure 89% → 11%; unaffected-tenant p95 latency within 8% of baseline; outage amplification (collateral session failures) 89% → 4%; mean time-to-recovery after provider heal 14 min → 2 min (half-open probes vs manual ops restart).

Technique decision table

Your context	Prefer	Avoid
Single provider, low traffic	Retry + timeout; simple error budget alert	Complex bulkhead matrix
Multi-tenant SaaS agents	Per-tenant bulkheads + per-provider breakers	One global pool
Tool-heavy workflows	Breaker per tool integration class	Breaker only on LLM HTTP client
Batch eval / offline jobs	Separate bulkhead; pause when breaker opens	Sharing interactive pool
Regulated workloads	Fail closed to DLQ with audit trail	Silent cross-region fallback without logging

Common pitfalls

Breaker too sensitive — trips on single timeout; flaps open/closed and blocks healthy traffic.
Breaker too slow — waits for 50% error rate while queue already dead; use latency trips too.
Shared breaker across heterogeneous tools — one bad sandbox takes down CRM reads.
Retry ignores open breaker — middleware order bug replays into a known-down API.
No half-open probes — ops manually restart services long after provider recovered.
Bulkhead without rejection path — threads block forever; worse than fast fail.
Fallback without breaker — secondary provider overloaded by 100% traffic shift.

Production checklist

Define breaker per dependency class (completion, embed, each critical tool family).
Configure sliding-window error and latency thresholds with documented rationale.
Implement closed / open / half-open FSM with probe batch size and cooldown.
Ship bulkhead pools for interactive, batch, and per-tenant fair-share lanes.
Order middleware: rate limit, bulkhead, breaker, then retry wrapper.
Wire open-breaker events to model fallback and degradation tiers.
Expose breaker state and pool utilization on dashboards and traces.
Route exhausted runs to DLQ with breaker snapshot for replay.
Load-test breaker behavior: verify fail-fast under synthetic 503 storm.
Run game days: provider brownout drill without disabling breakers.

Key takeaways

Retries alone amplify outages — breakers stop retry storms.
Bulkheads protect the fleet from one slow dependency consuming all workers.
Agent stacks need per-layer breakers — model, embed, tools, vector store.
Half-open probes restore traffic automatically when dependencies heal.
Harbor Platform cut collateral failures 89% → 4% with per-provider breakers and bulkhead pools.