Runbook Automation: Which Steps to Automate First

Runbooks are only as good as your ability to execute them under pressure. At 3am, with Slack pinging and a P1 open, even a well-written runbook introduces friction: you have to read it, copy-paste commands, confirm you're on the right host, and try not to make a typo.

Automating parts of that process doesn't mean removing the human from the loop. It means handling the parts that are repetitive and deterministic so the on-call engineer can focus on the parts that actually require judgment.

The question is: which steps should you automate first?

The four criteria for automation candidates

Not every runbook step is a good automation candidate. Before you write a script, run it through these four checks:

1. Is it deterministic? The action should produce the same result every time it's triggered, given the same alert context. "Restart the web workers when memory exceeds 90%" is deterministic. "Decide whether to roll back or roll forward" is not.

2. Is the blast radius low if it goes wrong? Automation that can cause more damage than the incident itself isn't worth the risk. Restarting a single service is low blast radius. Dropping a database table, even if it's in a runbook, shouldn't be automated.

3. Is it executed frequently? Automation takes time to build and maintain. It pays off faster on steps that fire once a week than once a year. Check your incident history: if a runbook hasn't been used in six months, it's probably not your highest-priority automation target.

4. Does your team agree on what "correct" looks like? If two engineers would make different decisions about whether to run the step, it's not ready to automate. That disagreement is a signal the step needs more definition first.

Where to start: five steps worth automating early

These are the runbook actions most on-call teams can automate without much risk, that also pay off quickly:

1. Restarting a service or pod

This is the most common first automation. For many alerts, a service restart is the fastest path to recovery while a root cause investigation runs in parallel. It's low risk, it's fast, and it's almost always the first step someone does manually anyway.

For containerized workloads:

kubectl rollout restart deployment/my-service -n production

Wire this to your on-call platform so it runs automatically when a specific alert fires, or exposes a one-click button in the incident context so the on-call engineer doesn't have to SSH anywhere.

2. Clearing a cache

Cache poisoning and stale cache entries cause a surprising number of incidents. If your runbook says "flush Redis cache for user sessions," that's a great automation candidate: the command is always the same, it's non-destructive, and it's often reversible within seconds if it was the wrong move.

3. Posting a status page update

Updating your status page is critical during an incident, but it's also the kind of task that gets delayed because everyone assumes someone else is doing it. Automating the initial "Investigating" status update when a P1 fires removes the delay and frees the incident commander to focus on mitigation.

Status page automation works best when your alert severity maps directly to your status page incident types. Define that mapping once and the automation handles the rest.

4. Scaling a resource

Autoscaling handles expected load patterns. It doesn't always catch sudden traffic spikes before they cause latency. If your runbook includes a step like "increase web tier to 20 instances," automate it. The risk is a slightly higher cloud bill for a few hours, which is almost always better than user-facing impact.

5. Running a diagnostic health check

Many runbooks start with "verify the issue is what the alert says it is." This verification step, checking database connectivity, pinging a health endpoint, confirming a queue depth, can almost always be automated. Attaching the output directly to the incident saves 5-10 minutes of manual investigation and gives the on-call engineer data before they've even opened their laptop.

What not to automate yet

There's a category of runbook steps that look like automation candidates but aren't quite there. Be careful with:

Rollbacks. A rollback decision depends on what broke, how bad it is, and whether the previous version actually fixes the problem. That context lives in the engineer's head and in the git log, not in a script. Keep rollbacks a one-click action with strong confirmation prompts, not a fully automatic response.

Anything touching persistent data. Deleting rows, migrating schemas, purging queues. Even if the runbook says to do it, the risk profile is too high. A human should confirm before anything destructive runs against a production database.

Steps with ambiguous triggers. If your automation fires based on an alert, and that alert has a 20% false positive rate, your automation will do the wrong thing 20% of the time. Fix the alert quality first.

Tooling options

You don't need a dedicated automation platform to get started. Teams usually start in one of two places:

Webhooks from your on-call platform. Most alerting tools support outgoing webhooks when an incident is created or escalated. Point those at a small service (or even a Lambda function) that runs the diagnostic or remediation script. This gets you 80% of the value with minimal infrastructure.

Rundeck or similar. For teams that want a proper job runner with audit logs, approval flows, and a UI for manual one-click execution, Rundeck is a practical choice. You define the job once and use it both for automated triggers and for on-call engineers who want to run a step without SSH access.

For Kubernetes workloads, AWS SSM Run Command and similar tools let you execute scripts on instances without giving engineers direct SSH access to production, which reduces the risk of typos and unauthorized changes.

How to measure if it's working

You want to see two numbers move after automating runbook steps:

Time to first action (TTFA). How long from alert firing to the first remediation step running? Automation should drive this from minutes to seconds for covered scenarios.

Incident duration for the automated scenarios. If you automated the restart step for your web workers OOM alert, does that class of incident resolve faster on average? If not, the restart isn't the real fix, which is useful to know.

Track these per alert type, not as a blended average. A blended MTTR improvement can hide the fact that some automations aren't doing anything.

Start with one alert

Pick the alert that fires most often and has the clearest remediation step. Automate just that one step. Run it for a month and see whether incidents are shorter, whether the on-call team trusts it, and whether it generates false positive actions.

That's a more useful signal than automating ten things at once and trying to figure out which one helped.

The goal isn't a fully automated NOC. It's getting the first five minutes of an incident to be faster and less stressful so your engineers can focus on the parts that actually need them.