NearIRM Team
NearIRM Team7 min read

The First 10 Minutes of an Incident: A Triage Checklist

You get paged. Your phone buzzes, the alert fires, and you have that brief moment of dread before you start figuring out what's actually wrong. What you do in the next ten minutes sets the tone for the entire incident.

This isn't about heroics. It's about having a mental model so you don't waste time on the wrong thing when stakes are high and your brain is half-asleep.

The goal isn't to fix it yet

The biggest mistake engineers make in the first few minutes is jumping straight to a fix. You noticed the database CPU is at 100% so you restart the service. You saw a spike in errors so you roll back the last deploy. Sometimes that works. Often it wastes time, hides symptoms, or makes things worse.

The goal of triage is to understand what's happening and how bad it is, not to immediately resolve it. Give yourself two to three minutes to assess before you touch anything.

Step 1: Acknowledge the alert (immediately)

Acknowledge before you do anything else. This tells your on-call tool you have it, stops escalation to the next responder, and starts your incident timer. It costs nothing and prevents your team from waking up unnecessarily.

If your alerting system doesn't auto-acknowledge on open, do it manually right away.

Step 2: Check for customer-facing impact

Within the first minute, answer one question: is this affecting customers right now?

Check your key signals:

  • Error rate on your primary user-facing APIs
  • Response times at the 95th or 99th percentile
  • Checkout/login/signup success rates if you have them
  • Any synthetic monitors that test critical user journeys

If the answer is yes, that changes everything. You're in an active customer-impacting incident. Skip further investigation and go to severity declaration now. You can diagnose while you communicate, but you shouldn't delay communication to get a complete picture.

If nothing customer-facing is broken yet, you have a little more room to investigate before pulling in more people.

Step 3: Declare a severity level

Pick one and move on. Don't wait for perfect information.

A rough guide that works for most teams:

SeverityWhat it means
P1 / SEV1Complete outage or critical feature down for all or many users
P2 / SEV2Significant degradation affecting a large percentage of users
P3 / SEV3Partial degradation, workaround available, or small percentage affected
P4 / SEV4Minor issue, no real user impact, can wait for business hours

If you're unsure, round up. It's much easier to downgrade a severity than to explain why you sat on a P1 for 20 minutes thinking it was a P3.

Step 4: Open an incident channel

For anything P2 or above, open a dedicated Slack channel (or Teams, or whatever your team uses) right now. Name it something consistent, like #inc-2026-03-27-api-errors. Don't use the general engineering channel.

Post immediately with what you know:

🚨 Incident declared: elevated error rate on /api/checkout
Severity: P2
Started: ~02:14 UTC
Impact: ~15% of checkout requests returning 500s
Status: Investigating
IC: @yourname

Even if you don't know the cause, this gives anyone who joins later a starting point. It also starts your paper trail, which you'll want for the postmortem.

Step 5: Loop in the right people

Now, not later. Waking someone up feels uncomfortable but it's better than a 30-minute solo debugging session that could have been a 5-minute fix with the right person in the room.

For P1: page your incident commander, the relevant service owner, and anyone who touched the system in the last 24 hours.

For P2: the service owner and possibly a second engineer if you're not making progress within 5-10 minutes.

For P3: optional, but document what you're doing so the team knows in the morning.

Assign an incident commander if it's P1 or P2. This should be someone separate from whoever is doing the technical investigation. The IC's job is communication and coordination, not debugging.

Step 6: Gather context without going too deep

You need enough context to direct the investigation, not a full root cause analysis. The questions to answer quickly:

When did this start? Check your monitoring for the exact time the signal changed. Compare that to your deployment history. Did anything ship in the 30-60 minutes before the alert fired?

What changed recently? Deploys, config changes, database migrations, infrastructure changes, cron jobs that ran, traffic spikes. Most incidents are caused by something that changed.

What's the blast radius? Is it one service or many? One region or global? One customer type or all users?

Are there any obvious error messages? Scan the logs quickly for exceptions, connection timeouts, or OOM errors. You're looking for a signal, not reading every line.

Set a time limit for this. Give yourself five minutes. If you don't have a hypothesis after five minutes of looking, bring in another person.

Step 7: Send your first external communication (if P1/P2)

If customers are affected, they should hear from you before they start tweeting about it. Most teams aim to post an initial status page update within 15 minutes of a customer-impacting incident starting.

Your first update can be brief:

We're investigating reports of elevated error rates affecting checkout. Engineers are actively working on this. We'll post an update in 30 minutes.

You don't need the cause. You don't need the fix timeline. You just need customers to know you know.

Step 8: Start your working notes

Somewhere, start capturing a running timeline. This can be in the incident channel, a shared doc, or your incident management tool. Include:

  • When the alert fired
  • When you acknowledged
  • Each hypothesis you tested and what you found
  • Each action you took and when
  • Who joined the incident and when

You won't remember all of this in the postmortem two days from now. Take notes as you go, even rough ones.

What good triage looks like in practice

Here's a realistic version of the first 10 minutes done well:

  • 00:00 Alert fires. You acknowledge immediately.
  • 00:30 You check dashboards. Error rate is up 12x on the payments service. Latency is high. Homepage is fine.
  • 01:00 You declare P2 and open #inc-api-payments.
  • 01:30 You post your initial message in the channel with what you know.
  • 02:00 You check deploys. Two services shipped in the last hour.
  • 03:00 You page the payments service owner and the engineer who did the most recent deploy.
  • 04:00 You scan logs. You see repeated connection timeout errors to the payment processor.
  • 05:00 You post an update: "Seeing connection timeouts to payment processor. Investigating whether this is on our side or theirs."
  • 07:00 The payments service owner joins. You hand them the context you've gathered.
  • 08:00 You update the status page with an initial customer communication.

By minute 10, you have: a severity, a focused investigation, the right people involved, customers informed, and a timeline started. You haven't fixed anything yet, but you've set up the rest of the incident to go well.

When things go wrong in triage

A few patterns that derail the first 10 minutes:

Going too deep too fast. You spend 8 minutes reading logs before you've even checked whether customers are affected. Pull back, orient, then focus.

Solo heroics. You don't want to wake anyone up, so you spend 20 minutes debugging alone what would have taken 5 minutes with the right person. Err on the side of looping people in earlier.

Skipping communication. You're heads-down fixing and forget to post updates. Your team and your customers are in the dark. Set a timer every 20-30 minutes to post a status update even if there's nothing new to say.

Treating every alert like a P1. Alert fatigue is real. If your P1s are firing multiple times a week on things that turn out to be noise, your team will start slow-rolling even real incidents. Tune your alerts and keep severity meaningful.


The first 10 minutes aren't about being the fastest or smartest engineer in the room. They're about being methodical when things are chaotic. Acknowledge, assess, declare, communicate, and get the right people involved. Everything else follows from that.

Related Posts