Guide

LLM agent model fallback and graceful degradation explained

Harbor Support shipped a customer-service agent on a single flagship model because eval scores were highest in the lab. During a regional provider degradation in March, completion latency spiked from 2.1s to 38s p95 and 31% of runs returned hard 503 errors with no recovery path. Agents that had already called refund tools stalled mid-trajectory; users saw blank chat bubbles. First-contact resolution dropped 44% → 19% and CSAT on the bot channel fell twelve points in one week. The outage was not unique — the architecture was: one model string in config, no ladder, no degraded mode, no health gate.

Model fallback is the practice of routing agent work across an ordered ladder of models and providers when the primary is slow, rate-limited, or down. Graceful degradation is what you do when every model in the ladder is stressed: shrink tool sets, shorten context, switch to retrieval-only answers, or hand off to a human with preserved state. This guide covers routing ladders, health probes, capability matrices, degraded UX tiers, integration with rate limiting and middleware hooks, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why single-model agents fail in production

Demos assume the model endpoint is always warm. Production agents face provider maintenance, regional brownouts, TPM throttling, model deprecations, and sudden latency cliffs when a new model version rolls out behind the same API name. A chat UI can show a spinner; an agent loop with five tool calls in flight cannot — each failed completion wastes prior tool work and burns user trust.

Without fallback you get three failure modes:

Hard fail — the run aborts; partial tool side effects may already be committed.
Retry storm — the runtime retries the same overloaded endpoint, amplifying the outage (see backpressure).
Silent quality collapse — the provider returns 200 but with 40s latency; users abandon before the agent finishes planning.

Fallback is not about always using the cheapest model. It is about matching capability to task criticality under stress while keeping the agent loop alive.

The model routing ladder

A routing ladder is an ordered list of model candidates per task class, not one global default. Harbor Support now defines separate ladders for planner, tool-arg filler, summarizer, and user-facing writer roles.

Ladder anatomy

Each rung specifies: provider, model ID, max latency budget, max cost per 1k tokens, minimum capability tags (e.g. json_mode, function_calling, 128k_context), and a warmup_probe interval. Example planner ladder:

primary   → gpt-4.1 (json_mode, 8s budget)
fallback1 → gpt-4.1-mini (json_mode, 5s budget)
fallback2 → claude-sonnet (tool_use, 6s budget)
fallback3 → local-8b-planner (json_mode, 3s budget, retrieval-only tools)

The router tries primary until a trip condition fires: HTTP 5xx/429, latency > budget, empty tool call, or schema validation failure after one repair attempt. It then steps down, logging the hop in tracing spans so on-call can see which rung served traffic.

Capability matrix, not price alone

Downgrading from a flagship model to a mini model is fine for summarizing tool JSON; it is dangerous for multi-step refund authorization where one wrong tool argument creates a compliance ticket. Maintain a capability matrix: rows are task classes, columns are models, cells are allowed/forbidden/degraded-only. A planner step that calls issue_refund may not use rungs below tier-2 without human approval gate.

Cross-provider diversity

Ladders that only swap sizes within one provider still die when that provider has a regional incident. Include at least one rung on a different host for tier-1 flows. Normalize prompts and tool schemas so the second provider receives equivalent instructions — adapter code belongs in pre-model middleware, not copy-pasted per agent.

Health gates and circuit breakers

Reactive fallback (try primary, catch error, retry on backup) is too slow when p95 latency is already 30s. Health gates proactively skip unhealthy rungs:

Synthetic probes — every 30s send a 50-token completion to each rung; track error rate and p95. Open the circuit when errors exceed 5% for two consecutive windows.
Canary traffic — route 2% of live planner calls to fallback rungs continuously; compare tool-call parse success against primary.
Latency SLO breach — if rolling p95 > 2× baseline for five minutes, mark rung DEGRADED and prefer the next healthy rung for new runs (in-flight runs may still complete).
Quota signals — tie into rate limiter bucket depth; when TPM bucket is empty, jump to a rung on a different key or provider instead of queueing until users churn.

Circuits need half-open recovery: after a cooldown, send probe traffic before restoring primary share. Flapping circuits cause user-visible model personality shifts; hysteresis (stay on fallback for at least N minutes) stabilizes UX.

Graceful degradation tiers

When every rung is stressed, fallback exhausts. Degradation tiers define what the product does instead of failing closed:

Tier 0 — Full agent

All tools, full context, flagship or healthy primary model. Default steady state.

Tier 1 — Constrained agent

Disable non-essential tools (analytics lookups, upsell suggestions). Compress history via tool result summarization. Use smaller models for re-ranking only; keep tier-2 model for writes.

Tier 2 — Retrieval-only

No tool calls except search_kb. Answers cite help-center chunks with a banner: “Some account actions are temporarily unavailable.” Better than hallucinating refund status.

Tier 3 — Human handoff

Queue for live agent with exported run state: last user message, retrieved articles, partial tool results, model rung history. Pair with human-in-the-loop policies so handoff is not a black hole.

Tier 4 — Static fallback

Status page message and callback form. Use only when providers are globally down; still beats infinite spinner.

Product and legal must sign tier definitions before launch — especially which write tools are disabled at tier 1 vs tier 2.

Cost, quality, and observability under fallback

Fallback changes unit economics. Track per-rung spend in cost attribution ledgers: model_rung=primary|fallback1|... tags on every span. Finance should see that failover day cost 1.4× normal but saved revenue vs total outage.

Quality metrics must segment by rung: tool parse rate, schema validation pass rate, human escalation rate, CSAT. A fallback rung with 94% parse success may be acceptable for read-only flows and unacceptable for billing adjustments. Run weekly eval suites against each rung; demote rungs that fail golden trajectories.

Alert on rung mix shift: if >40% of traffic sits on fallback1 for an hour, that is an incident even if error rate is zero — you are paying latency and quality debt.

Harbor Support refactor walkthrough

Harbor rebuilt routing around a ModelRouter service with four components:

LadderRegistry — versioned YAML per task class; capability matrix enforced at route time.
HealthBoard — synthetic probes + live latency SLOs; circuits per provider/region.
TripPolicy — unified rules for 5xx, 429, timeout, schema repair failure; max one automatic hop per step (prevents infinite ladder walks).
DegradeOrchestrator — maps system stress score (open circuits + queue depth) to tier 0–3; UI banners driven by tier.
RungAudit — every hop logged to immutable audit trail for compliance replay.

Replay of the March incident with the new router: hard failure rate 31% → 1.2%, p95 planner latency during stress 38s → 6.4s (fallback1 + tier-1 tool shedding), CSAT recovered to within two points of baseline within ten days. Refund mis-tooling on fallback rungs dropped to zero after the capability matrix blocked write tools below tier-2.

Technique decision table

Scenario	Prefer	Avoid
Regional provider brownout	Cross-provider ladder + health circuits	Retry same endpoint indefinitely
TPM rate limit on primary	Jump to alternate key/provider rung	Blocking queue past UX tolerance
High-stakes write tools	Capability matrix; block low rungs	Blind downgrade to smallest model
Planner parse failures	One repair attempt then hop rung	Same model retry loop
Global provider outage	Tier 2 retrieval + tier 3 handoff	Hallucinated account actions
Cost spike during failover	Per-rung attribution + tier shedding	Silent budget overrun

Common pitfalls

One ladder for everything — summarization and refund planning need different rungs and budgets.
Fallback without schema parity — backup model returns markdown instead of JSON; tool loop breaks worse than a timeout.
No half-open recovery — circuits flap every minute; users see inconsistent tone and tool behavior.
Hidden degradation — users do not know write tools are disabled; they think the bot is broken or lying.
Ladder hops mid-trajectory without context trim — smaller context window on fallback rung truncates system prompt; inject compressed state package on hop.
Eval only on primary — fallback rungs drift until an outage exposes 60% parse failure.
Cross-provider prompt leakage — provider A system prompt tokens sent to provider B; policy violation and weird outputs.

Production checklist

Task-class routing ladders with at least three rungs including cross-provider.
Capability matrix gating write tools on lower rungs.
Health probes, latency SLO circuits, and half-open recovery.
Trip policy: 5xx, 429, timeout, schema failure with max one hop per step.
Degradation tiers 0–3 documented with product/legal sign-off.
User-visible banners when tier > 0.
Per-rung cost and quality metrics in tracing and FinOps ledgers.
Weekly golden evals executed on every rung in the ladder.
Context trim or state package on rung hop for smaller windows.
Integration with rate limiter bucket signals for proactive routing.
Game-day drill: disable primary in staging and verify tier transitions.

Key takeaways

Single-model agents fail loudly — provider incidents become total product outages without ladders.
Route by task class and capability — price-only downgrade risks compliance and tool accuracy.
Health gates beat reactive retry — proactive circuits preserve latency SLOs under brownouts.
Degradation tiers are a product decision — retrieval-only and human handoff beat hallucinated writes.
Harbor Support cut hard failures from 31% to 1.2% with ModelRouter ladders, capability matrix, and tiered degradation during provider stress.