
How to Write an Incident Response Plan (With Free Template)
An incident response plan is one of those things every team knows they should have but few actually write down. When production is on fire, nobody is going to read a 30-page document. But a short, focused plan that answers the right questions? That's the difference between a 20-minute fix and a 3-hour scramble.
If you don't have a plan yet, or if your current one lives in a dusty Confluence page that nobody's read since it was written, start over. Here's how to write one that works.
What Goes in an Incident Response Plan
A good plan answers five questions:
- How do we know something is wrong? (Detection)
- How bad is it? (Severity classification)
- Who does what? (Roles)
- How do we communicate? (Internal and external)
- What happens after we fix it? (Post-incident review)
That's it. Everything else is details. If your plan covers these five things clearly, your team can handle most incidents without improvising.
Severity Levels
Severity classification is the first decision during any incident. Get it wrong, and you either under-react to something serious or wake up the entire company for a non-issue.
Most teams use three to five severity levels. Four is the sweet spot for most organizations:
| Level | Name | Criteria | Response Time |
|---|---|---|---|
| P1 | Critical | Full outage, data breach, revenue impact | 15 minutes |
| P2 | High | Major feature degraded, significant user impact | 30 minutes |
| P3 | Medium | Minor impact, workaround exists | 4 hours |
| P4 | Low | Cosmetic issue, no user impact | 24 hours |
The specific criteria matter more than the labels. Your team needs to look at an incident and quickly decide which bucket it falls into. Vague criteria like "significant impact" aren't useful unless your team agrees on what "significant" means.
If you haven't defined severity levels yet, our severity matrix builder walks you through creating one with response times and notification channels for each level.
Roles During an Incident
When an incident starts, three roles matter most:
Incident Commander. This person owns the incident. They decide severity, coordinate the response, and make calls about escalation. They don't need to be the most senior person. They need to be organized, calm, and willing to make decisions without perfect information.
Technical Lead. The person actually diagnosing and fixing the problem. On a small team, this is whoever is on-call. On a larger team, it might be the engineer with the most context on the affected system.
Communications Lead. Someone who handles updates to stakeholders, customers, and status pages. During a P1 incident, the technical team shouldn't be writing customer emails. That's a distraction.
For small teams, one person might cover multiple roles. That's fine. The point isn't to fill every role, it's to make sure someone is explicitly responsible for each function.
Communication During Incidents
The biggest time-waster during incidents is communication chaos. People asking for updates in different Slack channels. Stakeholders pinging engineers directly. Duplicate threads about the same problem.
Set up a standard approach before incidents happen:
Internal: Use a dedicated incident channel (or create one per incident). Post updates every 15 minutes for P1, every 30 minutes for P2. Even if the update is "still investigating," post it. Silence makes people nervous and generates more interruptions.
Customer-facing: Have pre-written templates for status page updates. You shouldn't be wordsmithing during an active incident. Something like:
We are investigating reports of [brief description]. Some users may experience [specific impact]. We will provide updates as we learn more.
Stakeholders: Notify leadership for P1 and P2 incidents. Keep it brief. They want to know: what's happening, how bad is it, and when's the next update.
Escalation
Your plan needs clear escalation paths. When the on-call engineer can't fix something alone, who do they call? And what happens if that person doesn't respond?
A simple escalation chain:
- On-call engineer acknowledges the alert (5 min timeout)
- If no acknowledgment, escalate to backup (5 min timeout)
- If still unacknowledged, escalate to team lead (10 min timeout)
- For P1s, notify engineering leadership immediately
The timeouts are important. Without them, alerts sit in limbo while everyone assumes someone else is handling it.
Post-Incident Review
Every P1 and P2 should get a review. Not a blame session, and not a 20-page document. A focused review that answers:
- What happened? (Timeline)
- Why did it happen? (Contributing factors)
- How did we detect it? (Was alerting effective?)
- How did we fix it? (Resolution steps)
- What will we change? (Specific action items with owners)
Keep it short. Two pages max. If your postmortems are longer than that, you're over-documenting.
Generate Your Plan
If you want a starting point, our incident response plan generator creates a customized plan based on your team size, industry, and preferences. It generates a complete markdown document with severity levels, roles, communication protocols, and a post-incident checklist that you can adapt to your team.
You can also build supporting artifacts with our other free tools:
- Severity matrix builder for detailed severity criteria
- Runbook template generator for service-specific response procedures
- On-call schedule generator for setting up rotations
Keep It Updated
An incident response plan isn't a one-time document. Review it quarterly. After every major incident, ask: did the plan help or did we improvise around it? If you improvised, update the plan.
Also make sure new team members read it. During onboarding, walk them through the severity levels, the communication channels, and who to escalate to. The plan is only useful if people know it exists.
The best incident response plans are short, specific, and practiced. Write yours down, test it with a drill, and update it when reality doesn't match the document.
If you want to automate the alerting and escalation parts of your plan, NearIRM handles multi-channel notifications, on-call routing, and escalation policies so your team can focus on actually fixing the problem.