NearIRM Team
NearIRM Team5 min read

Designing Escalation Policies That Actually Work

Escalation policies are supposed to ensure that if one person misses an alert, someone else gets it. In practice, they often do one of two things: wake up half your team for something minor, or fail to reach anyone for something critical.

The problem isn't the concept of escalation. The problem is that most teams design escalation policies without thinking through what they're trying to accomplish. Here's how to design policies that actually work.

Start With Timeout Durations

The most important decision in an escalation policy is how long to wait before escalating. Too short, and you escalate before the first responder even has a chance to acknowledge the alert. Too long, and an urgent issue sits unattended while everyone assumes someone else is handling it.

A good default for the first timeout is 5 minutes. That gives the on-call engineer enough time to see the alert, assess the situation, and acknowledge it if they're working on it. If they don't respond in 5 minutes, escalate.

For the second tier, 10 minutes works well. By this point, you're escalating to someone more senior or a team lead, and you want to give the first responder enough time to either resolve the issue or confirm they need help before pulling in more people.

These are defaults, not rules. If your team is distributed across time zones or if your alerts are consistently high urgency, you might want shorter timeouts. If your alerts are often false positives, longer timeouts might make sense.

Who Goes at Each Tier

The most common mistake in escalation design is putting people at the wrong tier. Here's a reasonable structure:

Tier 1: The on-call engineer. This is the person whose job it is to respond to alerts. They should have enough context and access to triage and resolve most issues.

Tier 2: The backup or team lead. This is someone who can help if the primary on-call engineer is unreachable or if the issue is complex. They should have deeper knowledge of the systems and the authority to make judgment calls.

Tier 3: Senior engineers or a broader group. This tier is for "we need more hands" situations. It might be a second team, a senior engineer with specialized knowledge, or a wider group of responders.

Do not put managers or executives in your escalation policy unless they are hands-on and capable of actually fixing issues. Escalating to someone who will just ping an engineer is pointless. Cut out the middleman.

When to Skip Tiers

Not every alert needs to go through every tier. For critical alerts, you might want to skip straight to Tier 2 or even notify multiple tiers simultaneously.

For example, if your payment processing system goes down, you probably don't want to wait 15 minutes for escalation. You want the on-call engineer and their backup to both get alerted immediately.

Most alerting systems let you configure different escalation policies for different alert severities. Use this feature. Create a "standard" policy for routine alerts and a "critical" policy that escalates faster or pages more people.

Handling Severity Levels

If your monitoring system supports it, tie escalation behavior to alert severity. Here's a simple model:

SEV-1 (Critical): Production is down or severely degraded. Escalate immediately to Tier 1 and Tier 2 at the same time. Timeout for Tier 3 is 5 minutes.

SEV-2 (High): Significant impact but not total outage. Normal escalation policy. Tier 1 gets 5 minutes, Tier 2 gets 10 minutes.

SEV-3 (Medium): Minor degradation or potential issue. Longer timeouts. Tier 1 gets 10 minutes, Tier 2 gets 15 minutes.

SEV-4 (Low): Informational or non-urgent. No escalation, just notify the on-call engineer.

You can adjust these severities and timeouts based on your team's needs, but the key is that higher-severity alerts should escalate faster and reach more people.

Too Many Tiers Is a Red Flag

If your escalation policy has more than three tiers, something is wrong. Either your on-call coverage is fragmented, your team is understaffed, or you're escalating to people who can't actually help.

More tiers don't make your system more reliable. They just create confusion about who is responsible and add delays before the right person gets involved.

If you find yourself building a four or five-tier escalation policy, step back and ask why. Often the real issue is that your primary on-call engineer doesn't have the access or knowledge to handle most alerts. Fix that problem instead of building a convoluted escalation chain.

Test Your Escalation Policies

The only way to know if your escalation policies work is to test them. Run escalation drills where you trigger a test alert and see if it reaches the right people at the right times.

Testing reveals problems like misconfigured contact methods, people who left the team but are still in the rotation, or timeout durations that don't make sense in practice.

Do these tests during business hours first, and make sure everyone knows it's a drill. Once you're confident the policies work, you can run unannounced tests to see how the team responds under realistic conditions.

Policies Should Match Reality

Your escalation policy should reflect how your team actually operates. If nobody checks their phone between 11pm and 7am, don't configure escalation timeouts like someone will respond in 5 minutes at 2am. Adjust for reality.

Similarly, if your team is small and everyone pitches in during incidents anyway, you don't need a complex escalation policy. A simple two-tier structure is fine.

The goal is to get the right person involved at the right time. If your escalation policy does that consistently, it's working. If it doesn't, fix it.

At NearIRM, we make escalation policies easy to configure and test. You can set up different policies for different alert types, adjust timeouts on the fly, and see exactly who will be notified at each stage. Because escalation should be straightforward, not a puzzle.

Related Posts