How to Define Incident Severity Levels That Actually Work

Most teams adopt severity levels early, label them P1 through P4, and then spend the next two years arguing about whether an outage was a P1 or a P2. The tiers exist on paper, but nobody agrees on what they mean.

This happens because severity levels are often copied from another company's blog post without adapting them to the actual business. The result is a system where everything feels urgent, escalation is inconsistent, and the on-call engineer is left making judgment calls at 2 a.m. with no real guidance.

Here's how to build severity tiers that are unambiguous enough to use under pressure.

Why Severity Levels Matter

Severity levels answer two questions at the moment an incident starts: who needs to know, and how fast?

Without them, every alert gets treated as equally urgent. That's exhausting and counterproductive. A failed background job that retries automatically is not the same thing as a payment processor being down during a sales peak. Treating them the same burns out your team and trains them to ignore escalations.

Good severity levels:

Set a response time expectation for the on-call engineer
Determine which stakeholders get notified and when
Guide the decision about whether to page additional people
Give you clean data for retrospectives (how many P1s last quarter? are P2s trending up?)

The Four-Tier Model

Most teams do fine with four levels. Five or more creates too much debate at classification time. Two or three gives you too little resolution to route responses appropriately.

Severity	Common Name	Response Target	Who Gets Paged
P1	Critical	5 minutes	On-call + team lead + stakeholders
P2	Major	30 minutes	On-call engineer
P3	Minor	Next business day	Ticket queue
P4	Informational	No response required	No one (logged only)

These numbers are starting points. Your actual SLAs and team size will shape the final values.

Defining Each Level

The most common mistake is writing severity descriptions that still require judgment to apply. "Significant user impact" is not a definition. It's a feeling.

Good definitions use observable criteria: are payments processing? is login working? what percentage of requests are failing? If the on-call engineer can look at a dashboard and answer the question in under 30 seconds, the definition is good enough.

P1 (Critical)

P1 means the business is losing money right now, or users cannot perform the core action your product exists to support.

Examples:

Checkout or payment flow is returning errors for more than 1% of requests
The entire application is returning 5xx responses
Authentication is failing for all users
Data loss is occurring or confirmed

P1 triggers an immediate page, wakes up the team lead, and usually requires a status page update within 15 minutes. The on-call engineer does not wait to diagnose before escalating.

P2 (Major)

P2 means a significant feature is degraded or unavailable, but the core product still works. Users are affected but can often work around it.

Examples:

Email notifications are not delivering
The mobile app is crashing for a specific OS version
A dashboard or reporting feature is returning stale data
API response times are elevated (but requests are succeeding)

P2 gets the on-call engineer working on it within 30 minutes. No stakeholder page until you have an assessment. If the issue persists past a threshold (say, 90 minutes with no resolution path), it escalates to P1 treatment.

P3 (Minor)

P3 is a real bug, but it's not urgent. A user might notice it, file a support ticket, and be mildly frustrated. It doesn't block their workflow.

Examples:

A form validation error message is wrong
A chart renders incorrectly in a specific browser
A non-critical API endpoint returns a 404 intermittently

P3s go into the ticket backlog and get triaged during business hours. Nobody gets paged. The on-call engineer documents it and moves on.

P4 (Informational)

P4 exists to capture signals that aren't problems yet but might become one. Think of it as your noise floor.

Examples:

Disk usage crossed 70% (threshold, not crisis)
A scheduled job took longer than usual but completed
A third-party API returned a non-fatal warning

P4 alerts should never page anyone. They're logged, and a human reviews them periodically (weekly or as part of a reliability review). If a P4 alert keeps firing without action, it should either get a proper threshold or be removed.

If your P4 bucket is empty, you probably don't have a P4 tier, you have alerts that you're ignoring. Name what you're ignoring. It's safer.

Making Classification Fast

The goal is to classify an incident in under 60 seconds. A decision tree helps more than a prose description.

A simple flow:

Is a core user action (login, checkout, core API) broken? P1
Is a significant feature unavailable or degraded for a measurable group of users? P2
Is something wrong but users can still do their job? P3
Is this a threshold crossed or an anomaly without user impact? P4

Print this on a one-pager and add it to your runbook. On-call engineers shouldn't need to find a wiki page at 3 a.m. to figure out if they should wake someone up.

Common Pitfalls

P1 inflation. Teams under pressure start calling everything P1 so they feel taken seriously. Then P1 loses meaning. Track the count of P1s per month. If it's more than two or three, either your system is genuinely fragile or your thresholds are wrong.

No auto-escalation rules. If your P2 has no teeth (i.e., nothing happens if it's ignored for two hours), it's not really a P2. Build escalation timeouts into your alerting tool so that an unacknowledged P2 becomes a P1 after a defined period.

Severity set at alert creation, never revisited. The first responder should be able to adjust severity as the incident develops. What looks like a P3 at first might reveal itself as a P1 once you see the blast radius. Your incident tool should treat severity as mutable.

Different teams using different scales. Engineering calls it P1, the security team calls it SEV1, the account managers call it "critical." Pick one vocabulary and enforce it across the org. Mismatched terminology causes delays in cross-team incidents.

Connecting Severity to Your On-Call Tool

Once you've defined your tiers, encode them into your alerting configuration. In NearIRM, you can set per-severity escalation policies, notification channels, and response-time SLAs directly on each alert source. A P1 from your payment gateway routes differently than a P3 from your analytics pipeline without any manual routing decisions.

The goal is a system where the on-call engineer sees an incoming alert, sees the severity, and already knows what's expected of them. No judgment call, no ambiguity, no second-guessing at 2 a.m.

That's when severity levels actually work.

How to Define Incident Severity Levels That Actually Work

Why Severity Levels Matter

The Four-Tier Model

Defining Each Level

P1 (Critical)

P2 (Major)

P3 (Minor)

P4 (Informational)

Making Classification Fast

Common Pitfalls

Connecting Severity to Your On-Call Tool

Related Posts

AI Alert Triage: What It Actually Helps With (and What It Doesn't)

Chaos Engineering: A Practical Starting Point

On-Call Metrics Beyond MTTR: What to Actually Track