Guide
LLM agent model fallback and graceful degradation explained
Harbor Support shipped a customer-service agent on a single flagship model because eval scores were highest in the lab. During a regional provider degradation in March, completion latency spiked from 2.1s to 38s p95 and 31% of runs returned hard 503 errors with no recovery path. Agents that had already called refund tools stalled mid-trajectory; users saw blank chat bubbles. First-contact resolution dropped 44% → 19% and CSAT on the bot channel fell twelve points in one week. The outage was not unique — the architecture was: one model string in config, no ladder, no degraded mode, no health gate.
Model fallback is the practice of routing agent work across an ordered ladder of models and providers when the primary is slow, rate-limited, or down. Graceful degradation is what you do when every model in the ladder is stressed: shrink tool sets, shorten context, switch to retrieval-only answers, or hand off to a human with preserved state. This guide covers routing ladders, health probes, capability matrices, degraded UX tiers, integration with rate limiting and middleware hooks, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
Why single-model agents fail in production
Demos assume the model endpoint is always warm. Production agents face provider maintenance, regional brownouts, TPM throttling, model deprecations, and sudden latency cliffs when a new model version rolls out behind the same API name. A chat UI can show a spinner; an agent loop with five tool calls in flight cannot — each failed completion wastes prior tool work and burns user trust.
Without fallback you get three failure modes:
- Hard fail — the run aborts; partial tool side effects may already be committed.
- Retry storm — the runtime retries the same overloaded endpoint, amplifying the outage (see backpressure).
- Silent quality collapse — the provider returns 200 but with 40s latency; users abandon before the agent finishes planning.
Fallback is not about always using the cheapest model. It is about matching capability to task criticality under stress while keeping the agent loop alive.
The model routing ladder
A routing ladder is an ordered list of model candidates per task class, not one global default. Harbor Support now defines separate ladders for planner, tool-arg filler, summarizer, and user-facing writer roles.
Ladder anatomy
Each rung specifies: provider, model ID, max latency budget, max cost per
1k tokens, minimum capability tags (e.g. json_mode,
function_calling, 128k_context), and a
warmup_probe interval. Example planner ladder:
primary → gpt-4.1 (json_mode, 8s budget)
fallback1 → gpt-4.1-mini (json_mode, 5s budget)
fallback2 → claude-sonnet (tool_use, 6s budget)
fallback3 → local-8b-planner (json_mode, 3s budget, retrieval-only tools)
The router tries primary until a trip condition fires: HTTP 5xx/429, latency > budget, empty tool call, or schema validation failure after one repair attempt. It then steps down, logging the hop in tracing spans so on-call can see which rung served traffic.
Capability matrix, not price alone
Downgrading from a flagship model to a mini model is fine for summarizing
tool JSON; it is dangerous for multi-step refund authorization where one
wrong tool argument creates a compliance ticket. Maintain a
capability matrix: rows are task classes, columns are
models, cells are allowed/forbidden/degraded-only. A planner step that
calls issue_refund may not use rungs below tier-2 without
human approval gate.
Cross-provider diversity
Ladders that only swap sizes within one provider still die when that provider has a regional incident. Include at least one rung on a different host for tier-1 flows. Normalize prompts and tool schemas so the second provider receives equivalent instructions — adapter code belongs in pre-model middleware, not copy-pasted per agent.
Health gates and circuit breakers
Reactive fallback (try primary, catch error, retry on backup) is too slow when p95 latency is already 30s. Health gates proactively skip unhealthy rungs:
- Synthetic probes — every 30s send a 50-token completion to each rung; track error rate and p95. Open the circuit when errors exceed 5% for two consecutive windows.
- Canary traffic — route 2% of live planner calls to fallback rungs continuously; compare tool-call parse success against primary.
- Latency SLO breach — if rolling p95 > 2×
baseline for five minutes, mark rung
DEGRADEDand prefer the next healthy rung for new runs (in-flight runs may still complete). - Quota signals — tie into rate limiter bucket depth; when TPM bucket is empty, jump to a rung on a different key or provider instead of queueing until users churn.
Circuits need half-open recovery: after a cooldown, send probe traffic before restoring primary share. Flapping circuits cause user-visible model personality shifts; hysteresis (stay on fallback for at least N minutes) stabilizes UX.
Graceful degradation tiers
When every rung is stressed, fallback exhausts. Degradation tiers define what the product does instead of failing closed:
Tier 0 — Full agent
All tools, full context, flagship or healthy primary model. Default steady state.
Tier 1 — Constrained agent
Disable non-essential tools (analytics lookups, upsell suggestions). Compress history via tool result summarization. Use smaller models for re-ranking only; keep tier-2 model for writes.
Tier 2 — Retrieval-only
No tool calls except search_kb. Answers cite help-center
chunks with a banner: “Some account actions are temporarily
unavailable.” Better than hallucinating refund status.
Tier 3 — Human handoff
Queue for live agent with exported run state: last user message, retrieved articles, partial tool results, model rung history. Pair with human-in-the-loop policies so handoff is not a black hole.
Tier 4 — Static fallback
Status page message and callback form. Use only when providers are globally down; still beats infinite spinner.
Product and legal must sign tier definitions before launch — especially which write tools are disabled at tier 1 vs tier 2.
Cost, quality, and observability under fallback
Fallback changes unit economics. Track per-rung spend in
cost attribution
ledgers: model_rung=primary|fallback1|... tags on every span.
Finance should see that failover day cost 1.4× normal but saved
revenue vs total outage.
Quality metrics must segment by rung: tool parse rate, schema validation pass rate, human escalation rate, CSAT. A fallback rung with 94% parse success may be acceptable for read-only flows and unacceptable for billing adjustments. Run weekly eval suites against each rung; demote rungs that fail golden trajectories.
Alert on rung mix shift: if >40% of traffic sits on fallback1 for an hour, that is an incident even if error rate is zero — you are paying latency and quality debt.
Harbor Support refactor walkthrough
Harbor rebuilt routing around a ModelRouter service with four components:
- LadderRegistry — versioned YAML per task class; capability matrix enforced at route time.
- HealthBoard — synthetic probes + live latency SLOs; circuits per provider/region.
- TripPolicy — unified rules for 5xx, 429, timeout, schema repair failure; max one automatic hop per step (prevents infinite ladder walks).
- DegradeOrchestrator — maps system stress score (open circuits + queue depth) to tier 0–3; UI banners driven by tier.
- RungAudit — every hop logged to immutable audit trail for compliance replay.
Replay of the March incident with the new router: hard failure rate 31% → 1.2%, p95 planner latency during stress 38s → 6.4s (fallback1 + tier-1 tool shedding), CSAT recovered to within two points of baseline within ten days. Refund mis-tooling on fallback rungs dropped to zero after the capability matrix blocked write tools below tier-2.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Regional provider brownout | Cross-provider ladder + health circuits | Retry same endpoint indefinitely |
| TPM rate limit on primary | Jump to alternate key/provider rung | Blocking queue past UX tolerance |
| High-stakes write tools | Capability matrix; block low rungs | Blind downgrade to smallest model |
| Planner parse failures | One repair attempt then hop rung | Same model retry loop |
| Global provider outage | Tier 2 retrieval + tier 3 handoff | Hallucinated account actions |
| Cost spike during failover | Per-rung attribution + tier shedding | Silent budget overrun |
Common pitfalls
- One ladder for everything — summarization and refund planning need different rungs and budgets.
- Fallback without schema parity — backup model returns markdown instead of JSON; tool loop breaks worse than a timeout.
- No half-open recovery — circuits flap every minute; users see inconsistent tone and tool behavior.
- Hidden degradation — users do not know write tools are disabled; they think the bot is broken or lying.
- Ladder hops mid-trajectory without context trim — smaller context window on fallback rung truncates system prompt; inject compressed state package on hop.
- Eval only on primary — fallback rungs drift until an outage exposes 60% parse failure.
- Cross-provider prompt leakage — provider A system prompt tokens sent to provider B; policy violation and weird outputs.
Production checklist
- Task-class routing ladders with at least three rungs including cross-provider.
- Capability matrix gating write tools on lower rungs.
- Health probes, latency SLO circuits, and half-open recovery.
- Trip policy: 5xx, 429, timeout, schema failure with max one hop per step.
- Degradation tiers 0–3 documented with product/legal sign-off.
- User-visible banners when tier > 0.
- Per-rung cost and quality metrics in tracing and FinOps ledgers.
- Weekly golden evals executed on every rung in the ladder.
- Context trim or state package on rung hop for smaller windows.
- Integration with rate limiter bucket signals for proactive routing.
- Game-day drill: disable primary in staging and verify tier transitions.
Key takeaways
- Single-model agents fail loudly — provider incidents become total product outages without ladders.
- Route by task class and capability — price-only downgrade risks compliance and tool accuracy.
- Health gates beat reactive retry — proactive circuits preserve latency SLOs under brownouts.
- Degradation tiers are a product decision — retrieval-only and human handoff beat hallucinated writes.
- Harbor Support cut hard failures from 31% to 1.2% with ModelRouter ladders, capability matrix, and tiered degradation during provider stress.
Related reading
- Agent rate limiting and throttling explained — proactive routing when TPM buckets empty
- Middleware hook pipeline explained — normalize prompts across providers in one place
- Observability and tracing explained — rung-hop spans and SLO dashboards
- Human-in-the-loop explained — tier-3 handoff with preserved state