Guide

SLOs and error budgets explained

Teams that chase “100% uptime” usually burn out on noisy alerts while users still complain. Service level objectives (SLOs) flip the script: define reliability as a product choice with a measurable budget, not a moral imperative. An SLI (service level indicator) is the metric you actually measure — successful requests, latency under a threshold, job freshness. An SLO is the target over a window (e.g. 99.9% of requests succeed each month). The gap between perfect and the target is your error budget: the amount of failure you can afford before user trust erodes. That budget drives release policy — ship boldly while budget remains, slow down when it burns. This guide covers SLI selection, SLO target math, burn-rate alerting, error-budget policies, and how SLOs connect to observability, chaos testing, and safe deployments.

SLI, SLO, and SLA — three different contracts

These acronyms get conflated in job posts and vendor decks. Keep them separate:

SLI (indicator) — a quantitative measure of service behavior from the user’s perspective. Example: “fraction of HTTP requests that return 2xx/3xx within 500 ms, excluding client errors.”
SLO (objective) — an internal target for that SLI over a rolling window. Example: “99.9% success over 30 days.” SLOs are what engineering owns and optimizes.
SLA (agreement) — a contractual promise to a customer, often with financial penalties. SLAs should be looser than SLOs so you have buffer before breaching a contract.

If your SLA says 99.5% monthly availability but your internal SLO is 99.95%, you have roughly 0.45 percentage points of headroom — time to fix incidents before credits kick in. Never set SLO = SLA unless you enjoy surprise invoices.

Choosing SLIs that reflect user pain

A bad SLI measures something easy (CPU under 80%) instead of something users feel (checkout completes). Good SLIs are valid (correlate with user experience), measurable (cheap to collect continuously), and actionable (teams know how to improve them).

Common SLI families:

Availability — proportion of successful events. For APIs: good_requests / valid_requests. Exclude 4xx caused by bad client input unless your product treats them as failures.
Latency — proportion of requests faster than a threshold. Use histograms, not averages: “99% of search queries < 300 ms” beats “mean latency 120 ms” when a long tail ruins UX.
Throughput / freshness — for batch pipelines and data products: “95% of daily aggregates land before 06:00 UTC” or “99% of webhooks deliver within 60 seconds.”
Correctness — harder but critical for payments and ledgers: reconciliation error rate, duplicate charge rate, or divergence from a golden source.

Start with one SLI per user journey slice — login, read, write, payment — not one global number for a monolith that hides which path broke. Instrument at the edge (load balancer, API gateway) when possible so client geography and CDN behavior appear in the signal.

SLO targets and the math of nines

“Four nines” (99.99%) sounds impressive until you translate it to allowed downtime. Over a 30-day month:

99% — ~7.2 hours of bad events
99.9% — ~43 minutes
99.95% — ~22 minutes
99.99% — ~4.3 minutes

Higher nines cost exponentially more engineering time. A internal admin dashboard might live at 99.5%; a payment authorization path might need 99.95% or better. Pick targets by asking product and support: “At what failure rate do users churn or tickets spike?” Historical incident data beats aspirational posters.

Use rolling windows (28 or 30 days) for error budgets so a bad week does not reset arbitrarily on calendar month boundaries — though calendar months align nicely with SLA reporting. Some teams maintain both a long window for budget and a short window (7 days) for tactical alerts.

Error budgets: reliability as a finite resource

If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — roughly 43 minutes of failure budget per month for that SLI. Every incident, deploy regression, or slow burn consumes budget. When budget remains plentiful, teams can prioritize features and experiments. When budget nears zero, policy shifts toward stability work: freeze risky releases, pay down tech debt, expand test coverage, run game days.

Error budgets make trade-offs explicit in product meetings instead of reliability debates happening only after outages. Example policy:

> 50% budget remaining — normal release cadence; canary and feature flags encouraged.
25–50% remaining — increase scrutiny on deploys; require rollback plans and on-call acknowledgment.
< 25% remaining — freeze non-critical releases; focus on fixes, capacity, and incident follow-ups.
Budget exhausted — halt feature work affecting the SLO scope until budget recovers; executive visibility.

Budgets are team agreements, not automatic kill switches — but without teeth they become slide deck fiction. Tie them to real gates in CI/CD: optional manual approval when budget is low, mandatory postmortems when budget burns fast.

Burn-rate alerting: catch SLO breaches early

Waiting until monthly error budget hits zero means you learned too late. Google’s SRE book popularized multi-window, multi-burn-rate alerts: compare current error rate to the rate that would exhaust the entire budget over a shorter window.

Example: 99.9% monthly SLO allows 0.1% errors. If you observe 1% errors for one hour, you are burning budget 10× faster than sustainable — at that pace the month’s budget disappears in ~4.3 hours, not 30 days. Alert on:

Fast burn — high multiplier (e.g. 14×) over a short window (1 hour) → page on-call immediately.
Slow burn — moderate multiplier (e.g. 3×) over longer windows (6 h, 3 days) → ticket or chat alert for investigation.

Burn-rate alerts reduce noise compared to static thresholds (“error rate > 1%”) because they incorporate SLO context. Pair them with RED metrics (rate, errors, duration) on the same SLI definition so dashboards and pages show one consistent story.

Implementing SLOs in production

Practical stack patterns:

Request logs or metrics — emit status, latency_ms, and route labels; classify good vs bad at query time with recording rules.
Histograms for latency SLOs — Prometheus histograms or OpenTelemetry exponential histograms; compute percentile compliance without storing every span.
SLI exporters — tools like Sloth, OpenSLO, or Google Cloud Monitoring SLO objects codify SLI queries in YAML and generate burn alerts.
Error budget dashboards — single pane showing budget remaining, burn rate, and recent deploy markers; annotate with incident IDs.
Synthetic probes — supplement real traffic SLIs for low-traffic paths (admin APIs, webhooks) so blind spots do not read as 100% success.

Define SLIs in version-controlled config reviewed like application code. When a deploy changes what counts as “good,” the SLI definition should change in the same pull request — otherwise you optimize a moving target.

SLOs vs monitoring vs alerting

Monitoring watches known failure modes (disk full, certificate expiry). Alerting should primarily track SLO burn and user-visible symptoms — not every CPU twitch. Infrastructure alerts belong on dashboards for diagnosis, not necessarily pages, unless they predict imminent SLO violation.

During incidents, the question is not “Is Redis CPU high?” but “Are we consuming error budget for checkout?” Post-incident reviews should quantify budget burned and link to action items. Over time, recurring budget drains on the same SLI justify architectural investment — caching, sharding, circuit breakers — with product-visible ROI.

Validate SLOs with controlled failure via chaos experiments: if killing one pod does not move your availability SLI, either the SLO is too loose or the experiment did not hit the critical path.

Common anti-patterns

SLI measures the server, not the user — green internal health checks while CDN or DNS fails for customers.
Too many SLOs — dozens of targets nobody owns; start with 3–5 per service tied to journeys.
SLO tighter than necessary — 99.99% on a best-effort analytics export wastes budget that could fund features users pay for.
No error budget policy — SLO dashboards as wallpaper with no release consequences.
Excluding all 4xx/5xx arbitrarily — hiding real outages by reclassifying errors; be honest about what users experience.
Calendar-only thinking — forgetting that a 5-minute outage at month-end can exhaust a 99.9% budget instantly.

Production checklist

Identify top user journeys and pick one SLI each (availability, latency, or freshness).
Set SLO targets from user/support data, not competitor marketing.
Keep SLAs looser than internal SLOs with documented buffer.
Compute rolling error budget and publish a team-visible dashboard.
Configure multi-window burn-rate alerts; tune to reduce false pages.
Write an error budget policy (release freeze thresholds, escalation paths).
Mark deploys and incidents on SLO graphs; review budget in weekly ops meetings.
Revisit SLOs quarterly — products and traffic patterns change.

Key takeaways

SLIs measure user-visible behavior; SLOs set internal targets; SLAs are customer contracts.
Error budgets turn reliability into a finite resource that guides feature vs stability trade-offs.
Use burn-rate alerts to catch budget consumption early, not just monthly summaries.
Alert on SLO violation risk, not every infrastructure metric.
Connect SLOs to deploy policy, observability, and chaos testing so numbers drive action.