How to Write Runbooks That Actually Get Used

A runbook only has value at 2 AM when someone is stressed, half-asleep, and your system is on fire. That's the context to write for. Not for the calm, well-rested engineer reviewing documentation on a Tuesday afternoon.

Most runbooks fail that test. They're either too vague ("check the logs"), too long to scan, or three months out of date. When that happens, engineers skip them entirely and figure things out on their own, which is exactly what you were trying to avoid.

Here's how to write runbooks that people actually open and follow.

Start with the trigger, not the system

The most common mistake is writing runbooks organized around a service or component ("Redis Runbook", "Payment Service Runbook"). That structure makes sense to the person who wrote it. It doesn't help the person receiving an alert.

Organize runbooks around the alert or symptom instead:

high-error-rate-on-checkout not checkout-service
redis-memory-at-90pct not redis-infrastructure
api-latency-above-2s not api-gateway

The engineer responding to an alert should be able to search for exactly what the alert says and land on the right runbook. One alert, one runbook. When a runbook covers too many alerts or failure modes, it becomes a wiki page, not a runbook.

The four things every runbook needs

Strip out everything else, but don't leave out these four:

1. What's broken and why it matters

Two or three sentences max. What is this alert saying, and what's the user-visible impact?

The checkout error rate has exceeded 5% for more than 2 minutes.
Users are seeing failures when trying to complete purchases.
Revenue impact begins immediately.

Skip the history of the service. Skip architectural diagrams. Just tell me what's happening.

2. How to confirm it

The alert fired, but maybe it's a false positive. Give the responder a command or dashboard link to verify:

# Check error rate in the last 5 minutes
kubectl logs -n production -l app=checkout --since=5m | grep "ERROR" | wc -l

# Or check the Grafana dashboard:
# https://grafana.internal/d/checkout-overview

Concrete is better than abstract. An actual command beats "check the logs."

3. Ordered remediation steps

Number them. Keep each step to one action. Don't combine "check X and if Y then do Z" into a single step.

1. Check whether the upstream payment processor is returning errors:
   curl -s https://status.paymentprocessor.com/api/v2/status.json | jq '.status.indicator'

2. If the processor status is not "none", this is an upstream issue.
   Acknowledge the alert and post in #incidents. No further action needed until they recover.

3. If the processor status is "none", check the checkout pod logs:
   kubectl logs -n production -l app=checkout --since=10m | grep "ERROR" | tail -50

4. Look for database connection errors. If present, follow the db-connection-pool runbook.

5. If no obvious cause, restart the checkout deployment:
   kubectl rollout restart deployment/checkout -n production

6. Watch the error rate drop in Grafana. If it doesn't improve within 3 minutes, escalate to the platform team.

That example is specific enough to follow without thinking too hard. That's the goal.

4. Escalation path

Who do you call if the steps don't work? Be explicit. "Escalate to the platform team" isn't enough. Include a name or a role with a way to contact them.

If unresolved after 15 minutes:
- Platform on-call: page via NearIRM escalation policy "platform-critical"
- Payment team lead: @jane in Slack, or +1-555-0100 for P1 incidents

Keep the format scannable

Responders are not reading linearly. They're skimming for the step that matches where they are in the investigation. Format for scanning:

Use numbered steps for ordered actions
Use bullet points for parallel checks or options
Use code blocks for every command, not just inline backticks
Use bold sparingly, only for genuinely critical callouts

Avoid paragraphs of prose in runbooks. If you catch yourself writing three sentences in a row, break it into a list or a code block.

Tables work well when you need to describe multiple scenarios:

Symptom	Likely cause	First action
5xx errors, processor down	Third-party outage	Post in #incidents, no code action
5xx errors, processor up	Pod or DB issue	Check pod logs
Latency spike, no errors	Slow query or cache miss	Check DB slow query log

A table like this lets the responder jump straight to the row that matches what they're seeing.

The freshness problem

Runbooks go stale. The service gets refactored, the kubectl command changes, the escalation contact leaves the company. A stale runbook is worse than no runbook because it wastes time and erodes trust.

A few things that help:

Tie runbook reviews to postmortems. After any incident, check whether the runbook was accurate. If someone had to deviate from the steps, update it before closing the postmortem. The incident is fresh context, use it.

Put an "last validated" date at the top. Not a "last updated" date, since that just tells you when someone edited the file. A "last validated" date means someone confirmed the steps still work. Even a quarterly review cycle is better than nothing.

Make it easy to flag stale content. Add a feedback link or a "something wrong here?" note at the bottom. Engineers who follow a runbook and find it broken should be able to file a fix in under a minute.

What to leave out

Runbooks accumulate cruft. Things to cut:

Background history. Nobody needs to know the service was originally built in Rails before being rewritten in Go. Skip it.

Links to general documentation. "For more context, see the architecture overview" is fine in a design doc. In a runbook, it's a distraction. If the architecture context is necessary to follow a step, include just the relevant part inline.

Conditional trees that span multiple levels. "If A, check B. If B is X, do C, but if B is Y, check D first, unless E is also happening..." Flatten it. Split into multiple runbooks if needed.

Steps that require knowledge not in the runbook. If step 4 says "fix the database query," that's not a step, that's a task. Either spell out how to fix it or replace the step with "escalate to the database team."

Make runbooks part of the alert

The best runbook is one the engineer doesn't have to search for. Link directly from the alert to the runbook.

Most alerting tools let you attach a URL to an alert payload. Use it. The responder gets paged, opens the alert, clicks the runbook link, and starts working. No wiki searching, no guessing which page applies.

In NearIRM, you can include a runbook URL in your alert routing rules so it appears directly in the notification. Set that up when you create the alert, not after the first incident where someone couldn't find it.

Test them before you need them

Write a runbook, then hand it to someone who didn't write it and ask them to follow it on a non-production system. Watch where they get confused or have to ask you a question. Every question is a gap in the runbook.

Game days are a good forcing function here. Running a simulated incident against real runbooks reveals the gaps that normal doc reviews miss. The runbook that looked fine on paper often has a broken command, a missing permission assumption, or a step that skips over something the author considered obvious.

A good runbook isn't a wiki page. It's closer to a checklist pilot uses before takeoff: brief, ordered, and written so that even under pressure you don't have to think about what to do next. That's the standard worth holding them to.

How to Write Runbooks That Actually Get Used

Start with the trigger, not the system

The four things every runbook needs

Keep the format scannable

The freshness problem

What to leave out

Make runbooks part of the alert

Test them before you need them

Related Posts

On-Call Metrics Beyond MTTR: What to Actually Track

When the Outage Isn't Yours: Handling Third-Party Vendor Incidents

Alert Correlation: How to Group Related Alerts and Cut the Noise