
The Incident Commander Role: Why Every On-Call Team Needs One
When a major incident hits, most engineering teams do the same thing: everyone piles into a Slack channel, five people start poking the same service, someone silently rolls back a change without telling anyone, and the on-call engineer gets buried under questions while also trying to diagnose the problem.
It's chaos. And it's completely avoidable.
The incident commander (IC) role is borrowed from emergency services, where it's been standard for decades. Applied to software incidents, it's one of the highest-leverage changes a team can make to how they respond to outages.
What an incident commander actually does
The IC isn't the person who fixes the incident. That distinction matters.
The IC's job is to coordinate the response, not to be the smartest engineer in the room at that moment. Specifically, they:
- Own the incident timeline. The IC tracks what's been tried, what's been ruled out, and what's happening next. Without this, teams repeat work and lose track of their own investigation.
- Assign clear roles. They designate who's investigating, who's communicating externally, and who's watching for second-order effects. One person owns each task.
- Control the pace. They call regular sync-ups ("Let's take a 60-second readout every 10 minutes"), push for decisions when the team gets stuck, and decide when to escalate to bring in more people.
- Protect the investigators. When a VP asks "what's happening?" in the incident channel, the IC answers, so the engineers diagnosing the problem can stay focused.
- Document as they go. Not a full postmortem, just enough that the team can reconstruct the timeline later.
The IC is explicitly not the person typing commands or pulling up dashboards. They're running the response, not the investigation.
When you need one
For minor incidents, one engineer can handle everything. Alert fires, they investigate, they fix it, they write a brief note. The overhead of formal roles isn't worth it.
Once an incident crosses a certain threshold, that changes. A rough guide:
| Situation | IC needed? |
|---|---|
| Single engineer can diagnose within 10-15 minutes | No |
| More than 2 engineers involved | Yes |
| Customer-facing impact lasting more than 30 minutes | Yes |
| Multiple services or teams affected | Yes |
| Engineering leadership needs to be looped in | Yes |
| You've been stuck for 20+ minutes without progress | Yes |
The earlier you assign an IC during a long incident, the better. Waiting until things are already chaotic means the IC spends their first 20 minutes just figuring out what everyone's been doing.
Common mistakes teams make
The IC also tries to fix things. This is the most common failure mode. The moment the IC opens a terminal or starts pulling metrics, they've abandoned their coordination role. Someone else will fill the vacuum, usually nobody, and you're back to chaos. If you're the best person to diagnose a specific problem, hand off the IC role to someone else first.
No handoffs during long incidents. If an incident runs for four hours, the IC from hour one shouldn't still be running things at hour four. Rotate the role. Brief your replacement: here's what we know, here's what we're currently trying, here are the open questions.
The IC tries to answer everything. Other engineers and stakeholders will direct questions at the IC. The IC needs to be able to say "I don't know, ask [name]" or "that's out of scope for right now." Not every question deserves an answer mid-incident.
Assigning IC to the most senior person by default. Seniority doesn't automatically make someone a good incident commander. The IC role benefits from people who stay calm under pressure, communicate clearly, and can make decisions with incomplete information. Those skills don't always correlate with technical seniority.
A basic IC checklist
When you take on the IC role, run through this in the first five minutes:
- Confirm scope. What services are affected? What is the user-facing impact?
- Assign a scribe. Someone else should be taking notes. Not you.
- Identify who's investigating. Name them explicitly. "Alex, you own the database side. Jordan, you're on the application layer."
- Set a communication cadence. Decide how often you'll post updates to the incident channel and the status page.
- Identify any immediate mitigations. Can you reduce impact while the investigation continues? (Feature flag, traffic rerouting, rollback?)
- Note your start time and severity. This matters for SLA tracking and the postmortem.
During an active incident, clarity beats thoroughness. A brief, accurate update posted every 10 minutes is more useful than a detailed update posted once an hour.
How to build this into your team
Most teams don't need formal IC certification. You need a shared understanding of what the role is and enough practice that people can step into it under pressure.
Write down your IC responsibilities. Add a short "incident commander" section to your runbooks. What does the IC do? What do they not do? What's the handoff process for long incidents? If it isn't written down, every person will have a different mental model.
Designate an IC during drills and game days. Incident simulations are the best place to practice this, because the stakes are low and you can debrief afterward. Run a game day and explicitly assign one person as IC. Then debrief on how the coordination felt.
Rotate the IC role in on-call rotations. Don't always assign IC duties to the same person. Include IC shadowing as part of how you onboard new on-call engineers.
Debrief IC performance in postmortems. Postmortems usually focus on the technical cause and the fix. Add a section on the response itself: How did coordination go? Where did people duplicate work? Was the IC able to stay in their lane? This is how teams get better at the human side of incidents.
The IC isn't overhead
Some teams resist formalizing the IC role because it feels like process for its own sake. The argument: "We're a small team, we don't need hierarchy during incidents."
But coordination during incidents isn't hierarchy. It's division of labor. When five engineers are all looking at the same thing, four of them are wasted. When nobody owns the incident timeline, the timeline gets lost. When every stakeholder question interrupts the person debugging, the debugging takes longer.
The IC role doesn't slow incidents down. It removes the overhead that slows incidents down.
If your last major incident felt scattered, chances are you didn't have a clear IC. Try it on the next one.