
How to Calculate SLA Uptime (And What the Numbers Actually Mean)
When someone says their service has "four nines" of availability, they mean 99.99% uptime. That sounds impressive until you realize that 99.99% still allows 52 minutes of downtime per year. Whether that's acceptable depends entirely on what your service does.
SLA uptime math is one of those things that's easy to get wrong. Teams set ambitious targets without understanding the constraints, or they sign SLAs without realizing what they're actually committing to.
The Basic Formula
Allowed downtime = (1 - uptime%) x time period
That's it. If your SLA promises 99.9% uptime over a month:
- Month has roughly 43,200 minutes (30 days)
- 0.1% of 43,200 = 43.2 minutes
So you can be down for about 43 minutes in a month before you breach your SLA. That's not a lot. A slow deploy or a bad config push can eat through that budget fast.
What Each Nines Level Actually Means
Here's the yearly breakdown:
| Uptime | Allowed Downtime/Year |
|---|---|
| 99% (two nines) | 3 days, 15 hours |
| 99.9% (three nines) | 8 hours, 45 minutes |
| 99.95% | 4 hours, 22 minutes |
| 99.99% (four nines) | 52 minutes |
| 99.999% (five nines) | 5 minutes |
The jump from three nines to four nines is significant. You go from having almost nine hours of buffer to less than an hour. Five nines means you can be down for five minutes total across an entire year. That's essentially zero tolerance for planned maintenance, bad deploys, or upstream provider outages.
To calculate the exact numbers for any SLA target, use our free SLA uptime calculator. It shows downtime by period and includes error budget calculations.
Picking the Right Target
Most teams should not aim for five nines. It's expensive, it constrains your ability to ship changes, and most users can't tell the difference between 99.99% and 99.999%.
A better approach: look at what your users actually need.
Internal tools and dashboards? 99.5% is probably fine. Nobody's going to notice 3.6 hours of downtime in a month if it happens during off-hours.
Customer-facing web apps? 99.9% to 99.95% is a reasonable target. It gives you enough room for weekly deploys and the occasional bad config.
Payment processing, healthcare, financial systems? 99.99% or higher. Downtime has direct revenue or safety implications.
The mistake is treating the uptime number as a prestige metric. Higher isn't always better. Higher means more engineering effort, more infrastructure redundancy, and slower iteration speed.
Error Budgets: The Practical Side of SLAs
An error budget flips SLA thinking on its head. Instead of asking "how much uptime do we need?" it asks "how much failure can we tolerate?"
If your SLO is 99.9% and you serve 10 million requests per month, your error budget is 10,000 failed requests. That's your budget for bad deploys, experiments, infrastructure changes, and actual bugs.
When you've consumed most of your error budget, you slow down and focus on reliability. When you have plenty of budget left, you can ship faster and take more risks. It's a built-in throttle for balancing reliability with velocity.
Our error budget calculator handles this math for you. Input your SLO and request volume, and it shows your remaining budget and burn rate.
SLAs vs. SLOs vs. SLIs
These terms get used interchangeably, but they mean different things.
SLI (Service Level Indicator) is the measurement. Latency, error rate, throughput. It's what you're actually measuring.
SLO (Service Level Objective) is the internal target. "99.9% of requests should complete in under 200ms." It's a goal your team works toward.
SLA (Service Level Agreement) is the external commitment. It's what you promise to customers, usually with financial consequences if you miss it. SLAs should always be less aggressive than your SLOs, because you want a buffer.
If your SLO is 99.95%, your SLA might promise 99.9%. That gives you a 0.05% margin before you owe customers credits or refunds.
Measuring Uptime Correctly
How you measure uptime matters as much as the target itself.
Synthetic monitoring sends regular health checks to your service. If a check fails, you know your service is down. Simple, but it only catches full outages and misses degraded performance.
Real user monitoring tracks actual user experience. It catches things synthetic checks miss, like slow page loads, regional outages, or intermittent errors. But it's noisier and harder to turn into a clean uptime number.
Error rate monitoring counts failed requests as a percentage of total requests. This is the most practical approach for API-based services and maps directly to SLO math.
Most teams use a combination. Synthetic checks for availability, error rates for reliability, and latency percentiles for performance.
What to Do When You Breach
SLA breaches happen. What matters is how you handle them.
First, acknowledge it. If your SLA includes credits or compensation, honor them without making customers fight for it. Trust is hard to rebuild once lost.
Second, figure out why. Was it a one-time event or a systemic issue? A single bad deploy is an incident. Repeated breaches over multiple months is a reliability problem that needs engineering investment.
Third, communicate. Customers can handle occasional downtime. What they can't handle is not knowing what happened or whether it'll happen again. A clear postmortem builds more trust than perfect uptime.
Start With the Math
Before you set SLA targets, do the math. Use our SLA uptime calculator to see what your target actually means in real downtime. Then decide if your infrastructure, processes, and team can actually deliver on that promise.
If you're building an incident response system to help you meet those targets, NearIRM handles the alerting, escalation, and on-call scheduling so your team can focus on keeping things running.