Guide

Graceful degradation explained

Every production stack depends on services you do not control: payment gateways, recommendation engines, identity providers, third-party maps. When one of them slows or fails, the worst response is a blank 500 page. Graceful degradation keeps your Tier 0 path working by deliberately dropping or simplifying non-critical features instead of failing the whole request. This guide covers degradation tiers, fallback patterns (stale cache, static defaults, read-only mode), load shedding versus feature flags, critical-path timeouts, honest user messaging, pairing with circuit breakers, a Harbor Commerce product catalog worked example, a strategy decision table, common pitfalls, and a production checklist — alongside our API rate limiting guide and load balancing explainer.

Graceful degradation vs fail-fast

Fail-fast rejects work immediately when a dependency is unhealthy — correct for hard requirements (you cannot charge a card without the payment processor). Graceful degradation accepts partial success: the checkout completes even if the “customers also bought” widget is empty. The design question is always: what is the minimum viable response that still delivers user value?

Degradation is not the same as ignoring errors. You still log, alert, and measure fallback usage. The difference is user-facing: they get a slower, simpler, or slightly stale experience instead of an error boundary. Teams that confuse the two either hide outages (no alerts when fallbacks dominate) or over-degrade (serving wrong prices because “something is better than nothing”).

Degradation tiers

Map every feature to a tier before an incident forces the decision:

Tier 0 (critical) — must succeed or the operation aborts with a clear error. Examples: authentication, payment capture, inventory reservation, write to primary database.
Tier 1 (enhanced) — strongly expected but substitutable. Examples: personalized recommendations, real-time inventory counts, live shipping quotes. Fallback: generic list, cached count, flat-rate estimate.
Tier 2 (delight) — nice-to-have. Examples: social proof badges, animated confetti, A/B experiment assignments. Fallback: omit silently or show a static placeholder.

Document tiers in runbooks and encode them in code paths, not tribal knowledge. During a Sev-1, nobody should debate whether search suggestions are Tier 1 or Tier 2.

Fallback patterns that work in production

A fallback is only as good as its freshness guarantees and failure detection. Pick patterns that match how wrong stale data can be.

Stale-cache fallback

Serve the last known good value from Redis, CDN, or local memory when the origin times out. Tag responses with Cache-Control or an internal degraded: true header so downstream systems know. Set a maximum staleness age: product catalog might tolerate 15 minutes; stock levels might tolerate 30 seconds. Beyond that, switch to a safer fallback (hide the widget, show “availability unknown”).

Static and default fallbacks

Precomputed JSON files, hard-coded defaults, or generic content require no live dependency. A news site might show editor-picked headlines when the personalization API is down. Defaults must be safe: never show a $0 price or grant admin permissions because the auth enrichment service failed.

Read-only and queue-for-later modes

When writes are risky (replica lag, split brain), flip to read-only: users browse and search but cannot checkout until recovery. Alternatively, accept writes into an outbox queue and acknowledge with “order received, confirmation email pending” — only if reconciliation is automated and idempotent.

Reduced fidelity responses

Return fewer fields, lower-resolution images, or a simplified HTML shell. Mobile apps can ship bundled assets for offline-first Tier 2 features. The key is predictable shrinkage: users understand “recommendations temporarily unavailable” more than a half-rendered page with broken layout.

Load shedding vs degradation

Load shedding rejects or delays incoming work to protect surviving capacity — HTTP 503 with Retry-After, token-bucket per client, or dropping a percentage of requests at the load balancer. Shedding protects the system; degradation changes what each accepted request returns. Use both: shed traffic before overload causes cascading timeouts, then degrade features for requests you keep.

Priority queues help: process authenticated checkout before anonymous browse, health checks before analytics beacons. Pair shedding with rate limits so abusive clients are dropped first. Autoscaling adds capacity but lags spikes; shedding and degradation bridge the gap until replicas catch up.

Critical-path timeouts

Optional dependencies must not block Tier 0. Wrap each Tier 1 call in a tight deadline (often 50–200 ms for page render, longer for batch). On timeout, take the fallback immediately — do not wait for the upstream TCP hang. Use bulkheads: separate thread pools or connection limits so one slow recommendation service cannot exhaust the pool used for database reads.

Circuit breakers, feature flags, and UX messaging

A circuit breaker stops calling a failing dependency after error thresholds trip, failing fast locally instead of waiting on timeouts. Breakers and degradation are complementary: the breaker opens, your code path switches to the stale-cache fallback without hammering the sick service. When the breaker half-opens, send a trickle of probe traffic before full restoration.

Feature flags let operators disable Tier 2 features globally during incidents without redeploying. Flags differ from automatic degradation: flags are human-driven kill switches; degradation is code-defined per request. Combine them: a flag forces recommendations off; automatic degradation handles the case where the service is slow but not yet flagged.

What to tell users

Honest, brief messaging builds trust. Tier 2 omissions need no banner. Tier 1 gaps deserve inline copy: “Shipping estimates may be delayed.” Avoid alarming language for partial degradation; reserve site-wide banners for Tier 0 impact (checkout unavailable). Never claim “everything is fine” when fallbacks serve materially wrong data.

Worked example: Harbor Commerce product catalog

Harbor Commerce serves product detail pages from a core catalog service (Tier 0), a real-time inventory microservice (Tier 1), and a personalization API for “You may also like” (Tier 2). During a inventory service outage, the page must still render title, price, and images from catalog.

Request enters the BFF with a 300 ms total budget for optional calls.
Catalog fetch (Tier 0) gets 250 ms, no fallback — failure returns 404 or 503 if catalog itself is down.
Inventory call wrapped in 80 ms timeout with circuit breaker. On failure: serve in_stock: null and UI copy “Check availability at checkout”; log metric inventory_fallback_total.
Recommendations get 50 ms; on failure omit the carousel entirely (Tier 2 silent drop).
Stale cache for inventory: if breaker is open, skip live call and use Redis value if younger than 60 s; otherwise null stock state.
Incident flag disable_recommendations set by on-call skips even the timeout attempt, saving pool threads during peak.

Post-incident, dashboards compare p95 latency, fallback rate, and conversion during degradation vs baseline. If checkout conversion drops more than 2% when inventory is null, tighten cache TTL or add a conservative “limited stock” heuristic.

Strategy decision table

Situation	Preferred approach	Avoid
Optional enrichment API slow	Timeout + stale cache or omit	Blocking the whole page
Payment processor down	Fail-fast with retry guidance	Fake “order placed” without capture
Database replica lagging	Read-only mode or route reads to primary	Serving writes that contradict reads
Traffic spike exceeds capacity	Load shed + drop Tier 2	Unbounded queue growth
Third-party map tiles failing	Static map image or address text	Broken map iframe stealing focus
Search index stale	Degrade to DB prefix search	Empty results with no explanation
ML ranking model unavailable	Fallback to popularity sort	Random order presented as personalized
Global CDN miss storm	Shed anonymous traffic; serve static shell	Retry storm to origin

Common pitfalls

Fallbacks that lie — showing in-stock when unknown causes support churn and overselling.
No staleness bounds — serving week-old prices erodes trust worse than a brief outage.
Hidden degradation — if 90% of traffic hits fallbacks, on-call must page; metrics are mandatory.
Shared pools — optional calls starving Tier 0 threads; use bulkheads.
Cascading retries — clients retry degraded 503s and amplify load; use jitter and caps.
Testing only happy path — chaos drills and fault injection reveal missing fallbacks.
Degrading security — skipping fraud checks or auth validation “temporarily” is never acceptable.
Identical timeouts everywhere — one global 30 s timeout masks which dependency is slow.

Production checklist

Classify every feature as Tier 0, 1, or 2; document in service README and runbook.
Implement per-dependency timeouts shorter than the user-facing SLA.
Define fallback behavior in code for each Tier 1 dependency; test in CI.
Emit metrics: fallback_used, breaker state, staleness age.
Alert when fallback rate exceeds baseline (e.g. 5% for 5 minutes).
Pair breakers with fallbacks; verify half-open recovery does not stampede.
Run game days: kill dependency in staging, validate UX and checkout.
Review user-facing copy for Tier 1 gaps; localize if applicable.
Ensure load shedding triggers before thread exhaustion.
Post-incident: compare conversion and error budgets during degradation window.

Key takeaways

Tier 0 must succeed or fail clearly; Tier 1 and 2 shrink gracefully.
Timeouts and bulkheads keep optional dependencies off the critical path.
Stale cache and static defaults are the workhorses of degradation — with explicit max age.
Load shedding protects capacity; degradation changes response quality.
Circuit breakers + metrics + honest UX turn partial outages into survivable incidents.