NearIRM Team
NearIRM Team6 min read

How to Define SLOs That Your Team Will Actually Use

Most teams that write SLOs write them once, put them in a Confluence page, and forget about them. The SLOs don't influence on-call decisions, don't affect how alerts are configured, and don't change what gets prioritized. They're just documentation theater.

This guide is about writing SLOs that actually shape how your team works.

Start with what users notice, not what's easy to measure

The most common mistake is picking metrics based on what your monitoring stack already tracks. Latency at the 50th percentile is easy to export from Prometheus, so it ends up in the SLO. But your users don't care about median latency. They care about whether the checkout flow worked, whether their file uploaded, whether the report finished generating.

Start by asking: what does a bad experience look like for a user of this service? Not a degraded experience. A bad one. Work backward from that.

For most services, the meaningful signals come down to three things:

SignalWhat it capturesGood for
AvailabilityIs the service responding at all?APIs, user-facing endpoints
LatencyIs it responding fast enough?Interactive features, search, checkout
Error rateIs it returning correct results?Data pipelines, background jobs, mutations

Pick one or two that match what your users actually feel. Three SLIs per service is plenty. More than that and you'll spend your incident response time arguing about which metric is "the real one."

Choosing the right target number

"What should our availability SLO be?" is the wrong first question. The right question is: "How much downtime per month is acceptable before users start leaving or complaining to account managers?"

Here's a quick reference for what common targets actually mean in practice:

SLOMonthly downtimeWeekly downtime
99%~7.3 hours~1.7 hours
99.5%~3.6 hours~50 minutes
99.9%~43 minutes~10 minutes
99.95%~21 minutes~5 minutes
99.99%~4 minutes~1 minute

A few things to keep in mind when picking your number:

Match your dependencies. If your service calls a third-party API with a 99.9% SLA, you can't commit to 99.99% without building retry logic, fallbacks, or caching. Your SLO can't be more reliable than the things you depend on.

Be honest about your current baseline. Pull your actual availability for the last 90 days. If you're running at 99.7%, committing to 99.99% immediately is a good way to burn out your on-call rotation. Set a target that's a small improvement over where you are today, then raise it once you've made the changes to support it.

Tighter isn't always better. A 99.99% SLO on an internal admin tool means waking someone up at 3am for four minutes of downtime. That's often not the right call. Match the SLO to the business impact of the service going down.

Writing the SLO formally

Vague SLOs cause arguments. "The service should be fast" doesn't tell anyone what to do when response times start climbing. Be specific.

A well-formed SLO has four parts:

  1. What you're measuring (the SLI)
  2. How you're measuring it (the data source and calculation)
  3. The target (the number)
  4. The window (rolling 30 days, calendar month, etc.)

Here's an example for a payment API:

SLO: Payment API Availability
SLI: The proportion of HTTP requests to /v1/payments that return a non-5xx response
Measurement: Calculated from nginx access logs, excluding health check endpoints
Target: 99.9% over a rolling 28-day window

And one for latency:

SLO: Payment API Latency
SLI: The proportion of successful /v1/payments requests that complete in under 500ms
Measurement: Calculated from application traces (p99 is tracked separately but not part of the SLO)
Target: 95% of requests under 500ms over a rolling 28-day window

Notice that the latency SLO doesn't say "p99 latency must be under 500ms." It says 95% of requests must complete under 500ms. This is the SLI-based approach, and it's more useful because it connects directly to user experience: 95% of your users get a fast response. The remaining 5% might be on slow connections, running complex queries, or hitting edge cases you can optimize later.

Connecting SLOs to your error budget

An SLO is only useful if it drives decisions. The mechanism for that is the error budget.

If your SLO is 99.9% availability over 28 days, your error budget is 0.1% of requests, or about 43 minutes of downtime. When you burn through that budget, you stop taking reliability risks until it resets. That means no major deploys, no infrastructure changes, no new features that could introduce instability.

When you have budget to spare, you can move faster. When you're running low, you slow down.

The error budget is also what makes SLOs worth surfacing in your incident response tooling. During an incident, knowing you've burned 60% of your monthly budget in a single hour changes how aggressively you respond. It changes whether you pull in more people, whether you roll back, whether you escalate to leadership.

Making SLOs part of how you work

An SLO that lives only in a document isn't an SLO. It's a target. The difference is whether the number influences real behavior.

A few things that help:

Wire SLOs to your alerting. Your on-call alerts should reflect SLO burn rate, not just raw thresholds. An alert that fires when you're burning error budget at 2x the sustainable rate (multi-window, multi-burn-rate alerting) is far more actionable than an alert that fires when latency crosses an arbitrary threshold.

Review SLOs in postmortems. After every significant incident, check whether the incident was captured by your SLOs. If it wasn't, either the SLO is measuring the wrong thing, or you need an additional SLI. If it was, look at how much budget burned and whether your current targets still make sense.

Make SLO status visible. Your on-call engineer should be able to see current error budget burn without opening three different dashboards. Whether that's a Grafana panel, a Slack digest, or a status page internal to your team, it needs to be somewhere they'll actually look.

Revisit SLOs quarterly. Services change. Traffic patterns change. User expectations change. An SLO you set 18 months ago might be completely irrelevant to what the service does now. Block time to review and update them.

Common ways SLOs go wrong

Too many SLOs. If you have 15 SLOs for a single service, nobody will track all of them. Pick the two or three that matter most. You can always add more later.

SLOs that no alert references. If none of your alerts mention an SLO, the SLOs aren't doing anything. They're just numbers.

Targets set by committee to please everyone. SLOs should reflect what's achievable and what matters, not what sounds impressive. "Five nines" is a fine goal for a payment processor's core transaction path. It's overkill for an internal reporting dashboard.

SLOs that never fail. If your error budget is always 100%, either the service is genuinely rock solid or your SLO is too loose. Check your incident history. If you've had real incidents that didn't touch the error budget, the SLO isn't measuring the right thing.


Getting SLOs right takes a few iterations. Start with one service, pick one or two signals that users actually care about, set a realistic target, and build from there. The goal isn't perfect SLO coverage across every service on day one. The goal is having at least one SLO that changes how your team responds to incidents this month.

Related Posts