NearIRM Team
NearIRM Team6 min read

SLO Burn Rate Alerts: Stop Missing Incidents and Cut the Noise

If your SLO alerting wakes you up for blips that heal themselves, or fails to catch a slow-burn degradation until your error budget is already toast, you're not alone. Most teams start with a simple error rate threshold, discover it's useless, and then aren't sure what to replace it with.

The answer is multi-window, multi-burn-rate alerting. It's the approach Google published in the SRE Workbook, and once you understand the shape of it, it's not that complicated.

What burn rate actually means

Your error budget is the percentage of requests (or time) you're allowed to fail over a window, derived from your SLO. If your SLO is 99.9% availability over 30 days, you have a budget of 0.1% of requests.

Burn rate measures how fast you're consuming that budget relative to a sustainable pace. A burn rate of 1 means you're burning budget exactly as fast as you can afford to, over the whole window. A burn rate of 2 means you're burning twice as fast. A burn rate of 14.4 means your entire 30-day budget will be gone in 50 hours.

The formula is simple:

burn_rate = error_rate / error_budget_fraction

So if your error budget is 0.1% (0.001) and you're seeing a 1% error rate (0.01):

burn_rate = 0.01 / 0.001 = 10

At burn rate 10, you'd exhaust your 30-day budget in 3 days.

Why a single threshold doesn't work

The first instinct is to pick a burn rate threshold, say 2, and fire an alert whenever you exceed it for 5 minutes. That sounds reasonable. Here's why it fails in practice:

It misses slow burns. A burn rate of 1.5 won't trigger your alert, but it'll eat 45 days worth of budget in a 30-day window. By the time the month is up, you've breached your SLO and you never got paged.

It generates noise for short spikes. A 10-minute spike at burn rate 20 burns only 0.05% of your budget. Not worth waking someone up at 3am. But a single-window alert at "burn rate > 5" will fire every time.

It has a long reset time. Once the alert fires, it often takes the full measurement window to clear, even after the incident is resolved. Your team gets a flood of alerts during recovery.

The multi-window approach

The fix is to use two time windows for each alert: a long window to measure the burn rate, and a short window to confirm it's still happening right now. Both must be true for the alert to fire. This reduces false positives dramatically.

You also want multiple burn rate tiers, because a critical 2am page and a Slack notification are not the same thing.

Here's a practical starting point for a 30-day SLO:

SeverityLong windowShort windowBurn rateBudget consumed if left 1hr
Page now1h5m14.42%
Page now6h30m65%
Ticket3d6h110% in 3 days

The top row catches fast-moving disasters. The second catches sustained degradations that would breach your SLO within a day or two. The third is a slow-burn warning you can triage during business hours.

Prometheus example

If you're using Prometheus, here's what the high-severity alert looks like using recording rules:

# Recording rule (precompute for performance)
- record: job:slo_errors:rate1h
  expr: sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))

- record: job:slo_errors:rate5m
  expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Alert rule
- alert: SLOBurnRateHigh
  expr: >
    job:slo_errors:rate1h > (14.4 * 0.001)
    and
    job:slo_errors:rate5m > (14.4 * 0.001)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "High SLO burn rate on {{ $labels.job }}"
    description: "Burn rate {{ $value | humanize }} exceeds 14.4x threshold"

The 0.001 is your error budget fraction (1 - 0.999 for a 99.9% SLO). Adjust it for your target.

For the medium-severity tier, swap in 6h/30m windows and a 6 * 0.001 threshold.

Common mistakes when setting this up

Using request count instead of request rate. Your windows need to be rates (errors per second, or error fraction) not raw counts. Raw counts make your thresholds dependent on traffic volume, so a quiet night looks fine even when the error rate is terrible.

Forgetting low-traffic services. If a service gets 10 requests per hour, a 5-minute window might see 0 or 1 errors, making any burn rate calculation noisy. For low-traffic services, widen your short window to 30 minutes or switch to a time-based SLO (availability minutes) instead.

Setting the page threshold too low. Teams often start with burn rate 2 as the critical threshold, then wonder why they're getting paged constantly. The point of the high-severity tier is to catch genuine emergencies. Burn rate 14.4 (budget gone in 50 hours) is a better starting point. You can always tune it down after you've seen a few real incidents.

Not testing it. Write a small load test that generates errors at a known rate, run it in staging, and verify your alerts fire at the right times. You don't want to discover your PromQL is wrong during an actual incident.

Calibrating for your SLO window

The numbers above assume a 30-day rolling window. If your SLO uses a different window, the burn rates stay the same but the budget-consumed-per-hour math changes. For a 7-day SLO, burn rate 14.4 is even more urgent since your total budget is already smaller.

The general principle: pick your high-severity burn rate so that an unmitigated incident at that rate would breach your SLO within a few hours. Pick your medium-severity rate so it represents something you need to fix within a day.

Getting the alert routing right

The multi-window approach reduces alert noise, but it only works if you pair it with sensible routing. Your high-severity (burn rate 14.4) alerts should go to an on-call rotation with a tight SLA. Your low-severity (burn rate 1-2) alerts can go to a team Slack channel or create a ticket for the next working day.

If your alerting tool lets you annotate alerts with burn rate and remaining budget at time of firing, do that. When an engineer gets paged at 2am, seeing "burn rate 18, 3% budget remaining" gives them immediate context on severity and urgency before they've even looked at a dashboard.

The goal of SLO alerting isn't to page on every anomaly. It's to page when the customer experience is degrading in a way that matters, and to route everything else somewhere it'll actually get looked at.

Related Posts