NearIRM Team
NearIRM Team6 min read

Error Budgets: A Practical Guide for On-Call Teams

Most teams set an SLO, watch it on a dashboard, and then... mostly ignore it until something breaks badly enough to make it a meeting topic. The number sits there. Engineers get paged. The SLO ticks down. Nobody connects the two.

Error budgets fix that. They turn a passive percentage into something you can spend, track, and make decisions with.

What an error budget actually is

An SLO of 99.9% availability means you're promising that your service will be up 99.9% of the time over a rolling window. The other 0.1% is your error budget: the amount of "bad" time you're allowed before you've broken your promise.

Over 30 days, 0.1% works out to about 43 minutes of allowable downtime. That's it. Every incident, every deployment gone wrong, every flapping dependency chips away at those 43 minutes.

The budget isn't a punishment. It's a negotiation between reliability and velocity. You can spend it on risky deployments, experiments, and tech debt, or you can hoard it as a cushion. The choice is yours, but you have to make the choice consciously.

Calculating your error budget

The formula is simple:

error_budget = (1 - SLO_target) × window_in_minutes

Some common SLO targets:

SLO TargetMonthly budget (30 days)Weekly budget
99.9%43.2 minutes10.1 minutes
99.5%216 minutes (3.6 hrs)50.4 minutes
99.0%432 minutes (7.2 hrs)100.8 minutes
99.95%21.6 minutes5 minutes

Which window you pick matters. A 30-day rolling window smooths out rough weeks. A 7-day window is more sensitive to recent events. Most teams start with 30 days and adjust from there.

Your SLI (Service Level Indicator) determines what counts as "bad." Common choices:

  • Availability: percentage of requests that returned a non-5xx response
  • Latency: percentage of requests that completed within a threshold (e.g., under 500ms)
  • Error rate: percentage of requests that did not return an error

Pick the SLI that best reflects whether your users are having a good experience. Don't pick five of them and average them together.

Connecting error budgets to on-call decisions

This is where most guides stop at theory. Here's the practical part.

When your budget is healthy (>50% remaining)

You have room to move fast. This is the right time to:

  • Ship larger, riskier changes
  • Run experiments in production
  • Work down tech debt that requires service restarts
  • Do load testing or capacity work

Your on-call team shouldn't be blocking these. The budget exists precisely so you can spend it on things that matter.

When your budget is tight (10-50% remaining)

Slow down and get deliberate. Change freeze doesn't have to mean full stop, but it does mean:

  • Smaller deploys, more incremental rollouts
  • Extra care around deployments that touch the critical path
  • A quick look at what burned the budget recently

If your budget burned this fast and nobody knows why, that's a signal. Run a lightweight incident review even if no individual incident crossed your severity threshold. Sometimes budget disappears in a dozen small paper cuts.

When your budget is exhausted (0% remaining)

Full stop on risky changes. Your priority now is restoring reliability before the window resets.

Practically, this means:

  • Freeze non-critical deployments
  • Pull your on-call team into a focused review of the past few weeks
  • Identify the top one or two contributing causes and address them before the window resets

Tying error budgets to incident severity

Error budgets are useful when you wire them into how you classify and respond to incidents.

A rough model that works for many teams:

  • SEV-1 (site down): drains budget fast, immediate response regardless of budget status
  • SEV-2 (significant degradation): track the budget burn; if you're already tight, escalate faster than you normally would
  • SEV-3 (minor impact): log it, but a healthy budget means you can work it during business hours

When your budget is tight, some teams temporarily lower the severity threshold: incidents that would normally be SEV-3 get treated as SEV-2 until the window resets. This keeps the team focused on reliability when the margin is slim.

What to do when your SLO is wrong

Sometimes your error budget runs out in week one, every month, and your users aren't actually complaining. That's a sign your SLO is too tight for where your system actually is.

Other times, your budget is always full but your support queue is full of complaints. That's a sign your SLI isn't measuring the right thing.

Neither of these is a failure. SLOs are supposed to be living documents. Revisit them quarterly, check them against user feedback and support tickets, and adjust.

The worst thing you can do is set a target, watch it get breached, and do nothing because "that's just how the system is." An SLO nobody believes in is noise. An error budget nobody uses is just a chart.

A minimal implementation to get started

You don't need a sophisticated observability platform to get value from error budgets. Here's a starting point:

  1. Pick one SLI. Availability (successful request rate) is usually the easiest to start with.
  2. Set a 30-day rolling window and pick a target based on where your system actually is, not where you wish it was.
  3. Calculate your budget and put it somewhere visible. A Slack bot that posts the remaining budget each morning works fine.
  4. Add a rule to your on-call runbook: if the error budget drops below 20%, flag it in your weekly team sync.
  5. Review monthly. After three months, you'll have enough data to have a real conversation about whether the target is right.

That's it. No fancy tooling required to start. Add complexity only once the basics are running and your team is actually looking at the numbers.

The team dynamic piece

Error budgets only work if the on-call team and the product/engineering team share ownership of them. If on-call is a separate function that "handles incidents" while product ships features, the budget becomes a blame mechanism rather than a coordination tool.

The conversation you want to have is: "We have 12 minutes of budget left this month. Which of the three planned deploys do we hold?" That's a team decision, not a unilateral ops veto. When both sides own the number, it becomes a useful constraint rather than a political football.

That's the real value of error budgets: not the math, but the shared language they create between people who care about moving fast and people who care about staying stable. Usually, that's the same people.

Related Posts