
How to Write Runbooks That Actually Get Used
A runbook only has value at 2 AM when someone is stressed, half-asleep, and your system is on fire. That's the context to write for. Not for the calm, well-rested engineer reviewing documentation on a Tuesday afternoon.
Most runbooks fail that test. They're either too vague ("check the logs"), too long to scan, or three months out of date. When that happens, engineers skip them entirely and figure things out on their own, which is exactly what you were trying to avoid.
Here's how to write runbooks that people actually open and follow.
Start with the trigger, not the system
The most common mistake is writing runbooks organized around a service or component ("Redis Runbook", "Payment Service Runbook"). That structure makes sense to the person who wrote it. It doesn't help the person receiving an alert.
Organize runbooks around the alert or symptom instead:
high-error-rate-on-checkoutnotcheckout-serviceredis-memory-at-90pctnotredis-infrastructureapi-latency-above-2snotapi-gateway
The engineer responding to an alert should be able to search for exactly what the alert says and land on the right runbook. One alert, one runbook. When a runbook covers too many alerts or failure modes, it becomes a wiki page, not a runbook.
The four things every runbook needs
Strip out everything else, but don't leave out these four:
1. What's broken and why it matters
Two or three sentences max. What is this alert saying, and what's the user-visible impact?
The checkout error rate has exceeded 5% for more than 2 minutes.
Users are seeing failures when trying to complete purchases.
Revenue impact begins immediately.
Skip the history of the service. Skip architectural diagrams. Just tell me what's happening.
2. How to confirm it
The alert fired, but maybe it's a false positive. Give the responder a command or dashboard link to verify:
# Check error rate in the last 5 minutes
kubectl logs -n production -l app=checkout --since=5m | grep "ERROR" | wc -l
# Or check the Grafana dashboard:
# https://grafana.internal/d/checkout-overview
Concrete is better than abstract. An actual command beats "check the logs."
3. Ordered remediation steps
Number them. Keep each step to one action. Don't combine "check X and if Y then do Z" into a single step.
1. Check whether the upstream payment processor is returning errors:
curl -s https://status.paymentprocessor.com/api/v2/status.json | jq '.status.indicator'
2. If the processor status is not "none", this is an upstream issue.
Acknowledge the alert and post in #incidents. No further action needed until they recover.
3. If the processor status is "none", check the checkout pod logs:
kubectl logs -n production -l app=checkout --since=10m | grep "ERROR" | tail -50
4. Look for database connection errors. If present, follow the db-connection-pool runbook.
5. If no obvious cause, restart the checkout deployment:
kubectl rollout restart deployment/checkout -n production
6. Watch the error rate drop in Grafana. If it doesn't improve within 3 minutes, escalate to the platform team.
That example is specific enough to follow without thinking too hard. That's the goal.
4. Escalation path
Who do you call if the steps don't work? Be explicit. "Escalate to the platform team" isn't enough. Include a name or a role with a way to contact them.
If unresolved after 15 minutes:
- Platform on-call: page via NearIRM escalation policy "platform-critical"
- Payment team lead: @jane in Slack, or +1-555-0100 for P1 incidents
Keep the format scannable
Responders are not reading linearly. They're skimming for the step that matches where they are in the investigation. Format for scanning:
- Use numbered steps for ordered actions
- Use bullet points for parallel checks or options
- Use code blocks for every command, not just inline backticks
- Use bold sparingly, only for genuinely critical callouts
Avoid paragraphs of prose in runbooks. If you catch yourself writing three sentences in a row, break it into a list or a code block.
Tables work well when you need to describe multiple scenarios:
| Symptom | Likely cause | First action |
|---|---|---|
| 5xx errors, processor down | Third-party outage | Post in #incidents, no code action |
| 5xx errors, processor up | Pod or DB issue | Check pod logs |
| Latency spike, no errors | Slow query or cache miss | Check DB slow query log |
A table like this lets the responder jump straight to the row that matches what they're seeing.
The freshness problem
Runbooks go stale. The service gets refactored, the kubectl command changes, the escalation contact leaves the company. A stale runbook is worse than no runbook because it wastes time and erodes trust.
A few things that help:
Tie runbook reviews to postmortems. After any incident, check whether the runbook was accurate. If someone had to deviate from the steps, update it before closing the postmortem. The incident is fresh context, use it.
Put an "last validated" date at the top. Not a "last updated" date, since that just tells you when someone edited the file. A "last validated" date means someone confirmed the steps still work. Even a quarterly review cycle is better than nothing.
Make it easy to flag stale content. Add a feedback link or a "something wrong here?" note at the bottom. Engineers who follow a runbook and find it broken should be able to file a fix in under a minute.
What to leave out
Runbooks accumulate cruft. Things to cut:
Background history. Nobody needs to know the service was originally built in Rails before being rewritten in Go. Skip it.
Links to general documentation. "For more context, see the architecture overview" is fine in a design doc. In a runbook, it's a distraction. If the architecture context is necessary to follow a step, include just the relevant part inline.
Conditional trees that span multiple levels. "If A, check B. If B is X, do C, but if B is Y, check D first, unless E is also happening..." Flatten it. Split into multiple runbooks if needed.
Steps that require knowledge not in the runbook. If step 4 says "fix the database query," that's not a step, that's a task. Either spell out how to fix it or replace the step with "escalate to the database team."
Make runbooks part of the alert
The best runbook is one the engineer doesn't have to search for. Link directly from the alert to the runbook.
Most alerting tools let you attach a URL to an alert payload. Use it. The responder gets paged, opens the alert, clicks the runbook link, and starts working. No wiki searching, no guessing which page applies.
In NearIRM, you can include a runbook URL in your alert routing rules so it appears directly in the notification. Set that up when you create the alert, not after the first incident where someone couldn't find it.
Test them before you need them
Write a runbook, then hand it to someone who didn't write it and ask them to follow it on a non-production system. Watch where they get confused or have to ask you a question. Every question is a gap in the runbook.
Game days are a good forcing function here. Running a simulated incident against real runbooks reveals the gaps that normal doc reviews miss. The runbook that looked fine on paper often has a broken command, a missing permission assumption, or a step that skips over something the author considered obvious.
A good runbook isn't a wiki page. It's closer to a checklist pilot uses before takeoff: brief, ordered, and written so that even under pressure you don't have to think about what to do next. That's the standard worth holding them to.