Guide

LLM retry, fallback and resilience explained

Harbor Legal's contract-review pipeline failed quietly for eleven minutes on a Tuesday afternoon: Anthropic returned HTTP 529 (overloaded) on every clause-extraction call, the SDK retried three times with fixed 1-second delays, and the queue backed up to 240 documents. No circuit opened, no secondary model engaged, and attorneys saw only a spinning “Analyzing…” badge. The refactor introduced classified error handling (429/5xx vs 400/401), exponential backoff with full jitter capped at 32 seconds, a two-tier fallback chain (frontier → mid-tier → cached template), and a circuit breaker that shed load after five consecutive provider failures. P99 completion time during the next regional outage dropped from 11 minutes to 47 seconds with zero duplicate billable extractions.

Resilience for LLM applications is not the same as generic HTTP retry logic. Completions are expensive, slow, often non-idempotent, and may stream partial tokens before failure. This guide covers error taxonomy, backoff and jitter, idempotency keys, timeout budgets, model fallback chains, circuit breakers, ties to model routing and observability, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist.

Error taxonomy: what to retry vs fail fast

Blind retry-on-any-error amplifies outages. Classify provider responses before scheduling another attempt:

Category Examples Retry? Notes
Transient capacity 429, 529, 503, connect timeout Yes, with backoff Honor Retry-After when present; cap total wait.
Transient server 500, 502, 504, empty body Yes, limited May indicate partial generation; check idempotency.
Client / schema 400, 422, invalid JSON schema No Fix prompt or schema; retry wastes quota.
Auth / policy 401, 403, content filter block No Route to human or safe template; do not loop.
Context limit 413, context_length_exceeded No (retry after fix) Truncate, compress, or swap model — not same payload.

Log the error class on every attempt. Dashboards that lump all failures into “LLM error rate” hide whether you need more capacity, better prompts, or a fallback model.

Backoff, jitter and retry budgets

Fixed-interval retries synchronize across clients and deepen provider overload — the classic thundering herd. Use exponential backoff with full jitter: sleep random(0, min(cap, base * 2^attempt)). A base of 500 ms and cap of 32 s is a common starting point for chat APIs.

Pair backoff with a retry budget: at most 3–5 attempts per user-visible request, and a global budget per minute so one bad deploy cannot retry the entire fleet into bankruptcy. Track retry-attributed token spend separately in cost dashboards.

For batched or async jobs, push retries to a queue with visibility timeout rather than blocking the HTTP handler. Dead-letter the message after max attempts and surface it in an operator inbox — silent loss is worse than a visible failure.

Idempotency: avoid double answers and double bills

LLM calls are not naturally idempotent. A 504 after the provider accepted the request may still complete and bill. Mitigations:

  • Idempotency keys — pass a stable key per logical operation (e.g. contract_id + clause_index + prompt_version). Providers that support idempotency return the same completion on replay.
  • Outbox pattern — persist “requested” before calling the API; on retry, check whether a completion already landed.
  • Streaming checkpoints — if a stream dies mid-flight, decide explicitly: resume (hard), discard partial tokens, or return partial with a “incomplete” flag. Never blindly restart without deduplicating visible output.
  • Tool side effects — agent tools that send email or charge cards must use their own idempotency keys; LLM retry must not re-fire them.

Timeout budgets and end-to-end deadlines

A single LLM call can exceed user patience. Allocate a deadline tree from the outer request downward:

  • Browser / client timeout (e.g. 60 s for chat)
  • API gateway deadline (55 s)
  • Per-model attempt timeout (20 s first try, 15 s fallback)
  • Retrieval + rerank sub-calls (5 s each)

Subtract elapsed time before each retry so you never start a 20-second generation with only 2 seconds left on the parent deadline. When the budget expires, return a degraded response (cached summary, shorter model, human queue) rather than hanging.

Model fallback chains and provider diversity

A fallback chain tries alternative models or providers when the primary is unavailable or over latency SLO. Design chains deliberately:

  1. Quality tier — frontier model for first attempt; mid-tier for capacity failures; small model for last-resort summaries.
  2. Capability match — fallback must support the same modality (vision, JSON mode, context length). Dropping from 128k to 8k context without truncating input fails again.
  3. Prompt compatibility — version prompts per model family; do not paste Claude-specific XML into a GPT endpoint.
  4. Cost ceiling — document that fallback tier may reduce quality; optionally ask user consent before downgrade on high-stakes flows.

Cross-provider diversity (same model class on two hosts) beats single-vendor multi-region when the outage is API-wide. Keep credentials and rate-limit pools isolated so fallback traffic does not starve primary.

Circuit breakers and load shedding

When error rate or latency exceeds a threshold, a circuit breaker stops calling the failing dependency and fails fast for a cool-down window. States:

  • Closed — normal traffic; count failures in a rolling window.
  • Open — reject or reroute immediately; no calls hit the sick provider.
  • Half-open — allow a probe request; close on success, re-open on failure.

For LLM services, open the circuit on consecutive 529/503 or when p95 latency exceeds 2× baseline for 60 seconds. While open, route to fallback chain or return a cached “try again shortly” response. Pair with canary gates so a bad model version trips the breaker before full traffic absorbs it.

Harbor Legal refactor: from hope to policy

Before the refactor, Harbor Legal's extraction service wrapped the SDK default (three retries, no jitter) and surfaced no distinction between overload and bad prompts. Attorneys re-uploaded contracts, doubling load.

After:

  1. Middleware classifies errors and tags traces with retry_class.
  2. 429/529 use backoff + jitter; 400 routes to a “fix upload” UI.
  3. Primary Claude Sonnet → fallback GPT-4o-mini for clause labels only (not legal advice text) → static template with manual review flag.
  4. Circuit opens after 5 failures in 30 s; queue consumers pause 45 s.
  5. Idempotency key on matter_id + sha256(file) prevents duplicate extractions.

Observability dashboards split “user-visible failure” from “recovered via fallback” so product can tune chain order without conflating metrics.

Technique decision table

Approach Best when Weak when
Simple retry (same model) Rare blips, idempotent reads, low cost per call Regional outages, 529 storms, agent tool loops
Model fallback chain Capacity limits, need continuity of service Strict quality/regulatory parity across tiers
Circuit breaker + shed Protecting bankroll and provider relationship Low traffic where false opens hurt availability
Queue + async retry Batch extraction, email drafts, overnight jobs Interactive chat needing sub-second feedback
Cached / template fallback Read-heavy FAQ, status pages, known intents Novel reasoning, personalized legal/medical advice
Human escalation queue High-stakes failures after automated exhaustion Scale-sensitive consumer chat

Common pitfalls

  • Retrying 400s — burns quota and trains users to spam submit.
  • No jitter — synchronized retries extend provider outages.
  • Unbounded agent loops — tool + LLM retry compounds into dozens of calls per user message.
  • Fallback quality mismatch — mid-tier model hallucinates on tasks only frontier handles.
  • Ignoring Retry-After — fighting the provider's explicit guidance.
  • Double side effects — retried agent sends duplicate Slack alerts or charges.
  • Circuit never half-opens — stays open forever after brief blip unless probe logic exists.
  • Hidden degradation — users not told answer came from fallback tier on contractual workflows.

Production checklist

  • Classify provider errors into retryable vs fatal; document in runbooks.
  • Implement exponential backoff with full jitter; cap attempts and total wait.
  • Attach idempotency keys to billable operations; dedupe in persistence layer.
  • Define end-to-end deadline; subtract elapsed time before each retry attempt.
  • Configure fallback chain with capability checks (context, JSON mode, vision).
  • Trip circuit breaker on error-rate / latency SLO; auto half-open probe.
  • Metric: retry rate, fallback rate, circuit state, retry-attributed token cost.
  • Alert when fallback rate exceeds baseline (quality regression signal).
  • Test failover in staging by blocking primary provider hostname.
  • Document user-visible behavior when all automated tiers exhaust.

Key takeaways

  • 429 and 529 are routine — resilience policy matters more than uptime SLAs on paper.
  • Retry only transient errors with jitter; never loop on bad prompts or auth failures.
  • Idempotency keys and outboxes prevent duplicate completions and duplicate tool side effects.
  • Fallback chains trade quality for continuity — design tiers and disclose degradation where stakes are high.
  • Circuit breakers protect cost and provider relationships; pair with observability split by recovery path.

Related reading