Architecture guide

SLOs, Error Budgets, and Production Reliability: A Practical Guide

An SLO is a time-bounded reliability target for a user-facing journey; its error budget is the allowable bad events in that window—healthy teams alert on budget burn, not every metric blip, and slow launches when the budget is spent.

Service Level Objectives (SLOs) are internal reliability targets expressed as a percentage over a window—for example, 99.9% of API requests complete successfully in under 300 ms each calendar month. Service Level Indicators (SLIs) are the measurable good events divided by valid events; error budgets are the allowable unreliability (e.g. 0.1% in a 99.9% SLO) you can spend on launches, refactors, or aggressive rollouts. This guide shows how to choose SLIs, set realistic targets, tie alerts to SLO burn rather than noisy thresholds, and align product and engineering on trade-offs. The framing follows Google's Site Reliability Engineering book (Chapter "Service Level Objectives"), which popularized error budgets as a way to balance velocity and stability.

Key takeaways

Measure user-perceived reliability (successful, fast-enough requests)—not just server uptime. A service can be "up" but unusably slow.

Pick a small number of SLIs (often 1–3 per user journey): availability, latency, and sometimes freshness or correctness.

Connect on-call alerts to burn rate (how fast you consume the error budget), not to every blip in raw metric charts.

When the budget is exhausted or nearly so, slow feature launches and prioritize reliability work until the rolling window recovers.

From SLI to SLO: concrete examples

Availability SLI: proportion of HTTP GET /v1/orders/{id} calls that return 2xx or valid 404 (not 5xx) over all calls excluding client aborts. Latency SLI: proportion of those calls with server-side duration ≤ 200 ms at the edge, measured at the load balancer.

Example 99.9% monthly availability SLO: across ~43.8 million requests in 30 days, you can have ~43,800 bad requests before missing the objective. That remainder is your error budget for planned risk.

Stricter tiers exist: 99.95% allows ~4× fewer bad events than 99.9%; 99.99% is an order of magnitude stricter again—each "nine" materially increases engineering and redundancy cost.

Latency: percentiles vs SLI

Raw p99 dashboards help debugging but make poor SLOs alone because a single long incident can dominate. SRE-style SLIs often encode latency as a proportion under a threshold ("99% of requests faster than 300 ms") combined with availability.

Define whether you measure server-side, client-side, or end-to-end latency; each tells a different story. Mobile clients need network variance in the narrative.

Using error budgets in product decisions

If budget is healthy, teams may absorb more release risk: feature flags default on, database migrations proceed during business hours with monitoring.

If budget is depleted, freeze risky changes, add capacity, fix flaky dependencies, and postpone large refactors until burn slows—this is how reliability becomes an explicit product trade-off instead of an afterthought.

Document decisions: "We accept 99.5% API availability for this internal admin tool vs 99.9% for the customer API" so stakeholders do not debate targets mid-incident.

Alerting: multi-window burn rates

Google's multi-burn-rate alerting (described in SRE workbook material) uses short and long windows—for example, 2% budget burn in 1 hour (page quickly) vs slow burn over days (ticket, not page). This reduces pager fatigue while catching real regressions.

Every alert should link to a runbook: dashboards, likely causes, rollback commands, and escalation—not merely "high error rate".

Limitations

SLOs describe steady-state user experience; they do not replace security monitoring, fraud detection, or data-quality checks.

Choosing thresholds without baseline measurements produces gaming (narrow valid-event definitions). Start from measured distributions, then tighten.

Very small services with low traffic have noisy SLIs; use longer windows or merge correlated journeys.

Explore our Product Strategy, Custom Software, and AI Development services, or get in touch to discuss your project.

SLOs, Error Budgets & Reliability: Practical Guide | Baaz