SLO & Error Budget

SLI — Service Level Indicator

The actual measurement used to judge reliability. Must be a ratio: good events / total events.

Common SLIs:

SLI	Formula
Availability	`uptime / total_time`
Success rate	`non-5xx requests / all requests`
Latency	`requests under threshold / all requests`

Computed as: avg_over_time(up[28d]) × 100

The reliability target you commit to. Example: "99.9% of requests return a successful response over a rolling 28-day window."

Set in sre.yaml per service:

services:
  - name: api
    slo_target: 99.5

SLOs should reflect user pain, not engineering perfection. An SLO of 100% leaves no room to deploy, experiment, or absorb infrastructure noise.

Mean SLI attainment across all services — a fleet-wide health number useful for a status page or executive summary.

Allowed downtime or errors before breaching the SLO target.

If SLO = 99.9%, error budget = 0.1% of the window:

28 days = 40 minutes of allowed downtime
Error budget is the fuel for engineering risk: deploy, experiment, migrate, upgrade.

budget    = 100 - slo_target
burned    = max(0, 100 - sli)
remaining = (1 - burned / budget) × 100%

How fast you're consuming error budget relative to the pace that would exhaust it at window end.

Burn rate	Meaning
1	On track to exactly use the budget by day 28
> 1	Will exhaust budget early — action needed
> 2	Recommended: pause non-critical deploys

Short-window burn rates (1h, 6h) catch fast-moving incidents earlier than the 28d window.

burn_rate  = error_rate / budget_rate
error_rate = (100 - sli_pct) / 100
budget_rate = budget_pct / 100