
MTTR, MTTA, and MTBF: What They Measure and Why They Matter
Incident response metrics get thrown around a lot in postmortems and engineering reviews. MTTR this, MTTA that. But most teams track these numbers without a clear idea of what they're optimizing for or whether the numbers are even accurate.
Here's what each metric actually measures, how to calculate them, and when they're useful.
MTTA: Mean Time to Acknowledge
MTTA measures how long it takes from when an alert fires to when someone acknowledges it. It's a proxy for "how quickly does a human start looking at this problem?"
Formula: Average of (acknowledge time - alert time) across all incidents
If your MTTA is 15 minutes, that means your team typically takes 15 minutes to see an alert and say "I'm on it." That's 15 minutes where nobody is actively investigating.
MTTA is the metric that most directly reflects your on-call effectiveness. A high MTTA usually means one of three things:
-
Alert fatigue. Engineers are ignoring notifications because most of them are noise. They've been conditioned to not respond quickly. Fix the alert noise problem first.
-
Bad notification routing. The alert isn't reaching the right person through the right channel. A Slack notification at 3am won't wake anyone up. Phone calls will.
-
Gaps in on-call coverage. Nobody is actually on-call, or the person on-call doesn't have their phone nearby.
A good MTTA target for most teams is under 5 minutes for P1 incidents.
MTTR: Mean Time to Resolve
MTTR is the most commonly tracked incident metric. It measures the total time from when an incident is detected to when it's fully resolved.
Formula: Average of (resolution time - detection time) across all incidents
MTTR includes everything: the time to acknowledge, diagnose, fix, deploy, and verify. It's a broad metric that captures the full incident lifecycle.
The problem with MTTR is that it's too broad to be directly actionable. If your MTTR is 4 hours, is that because detection is slow? Because diagnosis takes too long? Because deploys are slow? MTTR doesn't tell you.
That's why it's useful to break MTTR into components:
- Time to detect: How long before anyone knows there's a problem?
- Time to acknowledge: How long before someone starts working on it?
- Time to diagnose: How long to find the root cause?
- Time to fix: How long to implement and deploy the fix?
- Time to verify: How long to confirm the fix worked?
When you break it down this way, you can find the bottleneck. Maybe your detection is fast but diagnosis is slow because your logging is poor. Or maybe everything is fast except deploys, which take 45 minutes.
MTBF: Mean Time Between Failures
MTBF measures how long your system runs between incidents. It's a reliability metric.
Formula: Average of (next incident start - previous incident end) across sequential incidents
If your MTBF is 7 days, you're having roughly one incident per week. If it's 30 days, you're having about one per month.
MTBF is useful for spotting trends. If your MTBF is shrinking over time, your system is becoming less reliable. Maybe you're shipping faster without adequate testing, or your infrastructure is hitting scaling limits.
A growing MTBF is a good sign. It means your reliability investments (better testing, improved monitoring, incident learnings) are paying off.
MTBF requires at least two incidents to calculate, and it's most meaningful with a larger dataset. Don't read too much into MTBF from three incidents.
Industry Benchmarks
The 2024 DORA report categorizes teams by their MTTR:
| Performance Level | MTTR |
|---|---|
| Elite | Under 1 hour |
| High | 1 to 4 hours |
| Medium | 4 to 24 hours |
| Low | Over 24 hours |
These are useful as rough benchmarks, but don't obsess over them. A team with simple architecture and low traffic will have a naturally lower MTTR than a team running a complex distributed system. Context matters.
To see where your team falls, plug your incident data into our MTTR calculator. It calculates all three metrics and shows your benchmark level.
Which Metrics Should You Track?
Track all three, but focus on different ones depending on your current problems.
If incidents take too long to resolve: Focus on MTTR and break it into components. Find the bottleneck.
If alerts go unnoticed: Focus on MTTA. Fix your notification routing and on-call coverage.
If you're having too many incidents: Focus on MTBF. Invest in prevention: better testing, canary deploys, improved monitoring.
Don't track these metrics in a spreadsheet and forget about them. Review them monthly. Look for trends. Compare them across severity levels (your P1 MTTR should be much lower than your P4 MTTR).
Common Pitfalls
Averaging across severity levels. A 4-hour MTTR that includes both P1s (resolved in 30 minutes) and P4s (resolved in 3 days) is meaningless. Break metrics out by severity.
Including non-incidents. If you count alerts that turned out to be false positives, your metrics will be skewed. Only include actual incidents.
Optimizing the metric instead of the outcome. If teams feel pressured to close incidents quickly, they'll mark incidents as resolved before they're truly fixed. Track reopen rates alongside MTTR.
Not accounting for business hours. A P4 that was detected at 5pm Friday and resolved at 9am Monday has an MTTR of 64 hours. But nobody was working on it during the weekend because it wasn't urgent. Either exclude weekends for low-severity metrics or only track business-hours MTTR for P3/P4.
Start Measuring
If you're not tracking these metrics yet, start simple. Record detection, acknowledgment, and resolution timestamps for every incident. You can calculate the metrics from that data.
Our MTTR calculator does the math for you. Add your incident timestamps and it shows MTTA, MTTR, and MTBF with benchmark comparisons.
Once you have baseline numbers, set improvement targets. A 20% reduction in MTTR over a quarter is aggressive but achievable if you focus on the biggest bottleneck.
And if you want those timestamps captured automatically, NearIRM tracks detection, acknowledgment, and resolution times for every alert. Your metrics update in real time without anyone filling out a spreadsheet.