Skip to main content

SLO & Error Budget

SLI — Service Level Indicator

The actual measurement used to judge reliability. Must be a ratio: good events / total events.

Common SLIs:

SLIFormula
Availabilityuptime / total_time
Success ratenon-5xx requests / all requests
Latencyrequests under threshold / all requests

Computed as: avg_over_time(up[28d]) × 100

SLO — Service Level Objective

The reliability target you commit to. Example: "99.9% of requests return a successful response over a rolling 28-day window."

Set in sre.yaml per service:

services:
- name: api
slo_target: 99.5

SLOs should reflect user pain, not engineering perfection. An SLO of 100% leaves no room to deploy, experiment, or absorb infrastructure noise.

Composite SLO

Mean SLI attainment across all services — a fleet-wide health number useful for a status page or executive summary.

Error Budget

Allowed downtime or errors before breaching the SLO target.

If SLO = 99.9%, error budget = 0.1% of the window:

  • 28 days = 40 minutes of allowed downtime
  • Error budget is the fuel for engineering risk: deploy, experiment, migrate, upgrade.
budget = 100 - slo_target
burned = max(0, 100 - sli)
remaining = (1 - burned / budget) × 100%

Burn Rate

How fast you're consuming error budget relative to the pace that would exhaust it at window end.

Burn rateMeaning
1On track to exactly use the budget by day 28
> 1Will exhaust budget early — action needed
> 2Recommended: pause non-critical deploys

Short-window burn rates (1h, 6h) catch fast-moving incidents earlier than the 28d window.

burn_rate = error_rate / budget_rate
error_rate = (100 - sli_pct) / 100
budget_rate = budget_pct / 100