NearIRM Team
NearIRM Team6 min read

On-Call Metrics Beyond MTTR: What to Actually Track

MTTR gets all the attention in reliability conversations. Engineers learn the term, managers ask about it in postmortems, and dashboards display it front and center. But MTTR alone tells you almost nothing about the health of your on-call program.

If your team is burning out, missing alerts, or resolving the same incident every other week, MTTR won't surface that. You need a wider set of metrics.

Alert volume per responder

The most overlooked number. Track how many pages each on-call engineer receives per shift, not total pages for the team. Per person.

Teams often look at aggregate counts and miss that certain rotations are brutally overloaded while others are quiet. A responder getting 30 pages in a 12-hour shift has no meaningful recovery time between incidents, regardless of how fast they close each one.

A reasonable threshold: fewer than 5 actionable alerts per shift. More than that and you're running hot. Much more and you have a systemic alerting problem, an understaffed rotation, or both.

Alert-to-action ratio

For every alert that fires, did someone actually do something? Or did it resolve on its own?

This ratio tells you what fraction of your alerts require human intervention. If you're at 30% or lower, most of your pages are noise. Responders learn to discount alerts, which is exactly when they start missing the ones that matter.

A useful proxy: after an alert fires, look at whether the responder ran any runbook steps, escalated, or opened a ticket. If the answer is "no, I just waited and it cleared," that's a false positive, even if the system technically misbehaved for a moment.

Track this per alert policy, not just overall. You'll find specific monitors that fire constantly and almost never lead to action.

Repeat incident rate

Count how many alerts fire more than once for the same root cause within a rolling 30-day window. High repeat rates signal that your team is treating symptoms rather than fixing underlying problems.

A 20% repeat rate means one in five incidents is something your team has already seen. That's wasted response time and a strong signal that you need to either fix the root cause or automate the remediation.

Some teams set a policy: any alert that fires more than three times in 30 days gets a dedicated engineering task to resolve the root cause permanently. The metric makes that decision mechanical rather than optional.

Escalation rate

What percentage of incidents require escalating past the first responder?

A moderate escalation rate (10-20%) is healthy. It means your on-call engineers are appropriately deferring to specialists when they hit the edge of their knowledge. A consistently high rate (above 40%) usually means one of two things: responders don't have enough context to handle common incidents, or your runbooks are missing or outdated.

Escalation data also helps identify knowledge gaps across the team. If escalations consistently go to the same two engineers, you have a bottleneck and a single point of failure.

Acknowledge time and resolve time, tracked separately

Most tooling reports these together or only shows MTTR. Track them separately.

Mean time to acknowledge (MTTA) measures how quickly someone picks up an alert. A high MTTA on critical incidents (above 5 minutes) suggests your notification channels aren't working, responders are unavailable, or the alert priority is misconfigured.

Mean time to resolve (MTTR) measures how long it takes to fix the problem. High MTTR might mean the incidents are genuinely complex, or it might mean responders are working without good runbooks and diagnostic tools.

If MTTA is low but MTTR is high, your team responds fast but struggles to diagnose. That's a knowledge or tooling problem.

If MTTA is high but MTTR is low, responders fix things quickly once they start but are slow to pick up alerts. That's a notification or availability problem.

The distinction matters because the fix is different.

Off-hours page rate

Track what percentage of pages hit responders outside of business hours. This is a proxy for the human cost of your alerting strategy.

Off-hours pages that resolve on their own are especially damaging. They pull someone out of sleep for no reason, and if they happen consistently, you'll see burnout even on teams with low total page counts.

If your off-hours rate is above 40-50% of total alerts and most of those don't require action, you have misconfigured thresholds, or you're alerting on things that could wait until morning.

A reference table

MetricHealthyWarningCritical
Pages per responder per shift< 55-15> 15
Alert-to-action ratio> 70%40-70%< 40%
Repeat incident rate (30-day)< 15%15-30%> 30%
Escalation rate10-20%20-40%> 40%
MTTA (critical incidents)< 3 min3-10 min> 10 min
Off-hours page rate< 30%30-50%> 50%

These thresholds aren't universal. A team running a low-criticality service can tolerate higher escalation rates. A team running payment infrastructure should aim for MTTA under 1 minute. Use the ranges as starting points and adjust based on what your specific system demands.

What to do with these numbers

Metrics without action are just dashboards.

Run a monthly on-call review where you look at each of these numbers for the prior period. Don't just report them: identify the specific alert policies or services driving the outliers.

For repeat incidents, assign ownership. The team responsible for a service should be required to drive repeat fires below a threshold before shipping new features, or you're building on an unstable foundation.

For high alert-to-action ratios, delete or silence the noisy monitors. This feels risky but isn't. If an alert consistently fires without action, you're training responders to ignore it. Better to remove it entirely and build a better signal from scratch.

For high escalation rates, run runbook audits. Look at the incidents that escalated and check whether documentation existed and whether it was useful. Most teams find that runbooks either don't exist for common incident types, or they exist but are too outdated to trust.

Track per team, not globally

Global on-call metrics hide a lot. A healthy average across your organization can mask one team that's completely underwater.

Slice these metrics by team, rotation, and service. The teams with the worst numbers usually aren't the ones who say something. They're the ones quietly handling pages every night and not complaining because it feels like the normal state of things.

Making the numbers visible is often enough to start a conversation. When an engineering manager can see that their responders are averaging 18 pages per shift, it stops being an abstract "we're busy" complaint and becomes something concrete they can act on.

Related Posts