
Toil in SRE: How to Identify It, Measure It, and Reduce It
Most on-call engineers know the feeling. It's 2am, you've been paged for the fourth time this week about the same thing. You restart the service, the alert resolves, and you go back to sleep knowing you'll probably do it again Thursday. That's toil.
Toil is one of the most damaging things you can ignore in an engineering org. It doesn't show up in incident severity reports. It doesn't get tracked in postmortems. But it compounds quietly, grinding down the people who keep your systems running.
What toil actually is
The term comes from Google's SRE book, but the definition is worth internalizing properly. Toil is work that is:
- Manual — a human has to do it, even though a machine could
- Repetitive — you've done the exact same thing before, probably many times
- Automatable — given time and effort, it could be scripted or eliminated
- Reactive — it comes at you instead of you choosing to do it
- Devoid of lasting value — once done, the system isn't better off; you've just maintained the status quo
The last point is the crucial one. Fixing a real bug or building a feature has lasting value. Restarting a pod because a memory leak hasn't been patched yet does not.
Here's what toil looks like in practice for on-call engineers:
- Manually restarting a service that crashes every few days
- Acknowledging and resolving alerts that are known to be noisy but haven't been cleaned up
- Running a deploy script by hand because the CI pipeline doesn't support that environment yet
- Copying metrics from a dashboard into a status update email every hour during an incident
- Looking up the same runbook steps every time a particular alert fires, because nothing is automated
None of these are "real" incidents. They're friction. But each one takes time, breaks focus, and pulls engineers away from work that actually matters.
How to measure how much toil your team has
You can't reduce what you don't track. The simplest approach is to ask engineers to log their interrupt work during on-call shifts.
A basic toil log looks like this:
| Date | Alert/Task | Time Spent | Could be automated? | Recurrence |
|---|---|---|---|---|
| Apr 8 | Redis OOM restart | 15 min | Yes | ~2x/week |
| Apr 9 | Manual DB snapshot before deploy | 20 min | Yes | Every deploy |
| Apr 10 | Updated status page manually | 30 min | Partial | Every major incident |
| Apr 11 | Acked flapping disk alert | 5 min | Yes | Daily |
After two or three on-call rotations, patterns emerge fast. You'll see the same services, the same alerts, the same manual steps showing up repeatedly.
A rough benchmark from Google's SRE practices: if more than 50% of an on-call engineer's time is going to toil rather than project work, that's a problem. Not a "worth mentioning" problem, a "someone needs to own this" problem.
You can also get quantitative by looking at your alert data. How many alerts resolved themselves without action? How many pages resulted in the same fix being applied? If your incident management tool tracks resolution notes, you can often mine those for patterns.
The biggest sources of toil in incident response
Once you start looking, toil shows up everywhere. A few of the most common culprits:
Noisy alerts that don't need human attention. An alert that pages an engineer but consistently resolves on its own within five minutes is pure toil. It's not catching real incidents, it's just interrupting sleep. These need to be either tuned, silenced, or replaced with a smarter detection rule.
Manual runbook steps that could be scripted. Runbooks are valuable, but a runbook that says "SSH into host-03 and run sudo systemctl restart app" is a toil machine. Every time that alert fires, someone loses 15 minutes. If a step is always the same and always works, automate it.
Repetitive escalation for known issues. If the same class of alert always gets escalated to the same person who applies the same fix, you have a process problem. Either that engineer documents the fix and hands it off, or the fix gets automated.
Status updates done by hand. During incidents, keeping stakeholders informed is important. Copying and pasting metrics into Slack every 30 minutes is not a good use of an engineer's time during an active outage. This should be templated at minimum, automated where possible.
How to actually reduce it
Reducing toil isn't about spending a week rewriting everything. It's about consistent, small bets.
Start with the highest-frequency items. Look at your toil log and find the thing you did the most times last quarter. Fix that one first. Even shaving five minutes off something that happens 40 times a month is 200 minutes back.
Automate remediation for known failure patterns. For services with well-understood failure modes, you can run auto-remediation scripts that trigger on specific alerts. Here's a minimal example using a webhook-triggered script:
#!/bin/bash
# auto-remediate.sh
# Triggered when redis-oom alert fires
SERVICE="redis"
HOST=$1
echo "Restarting $SERVICE on $HOST"
ssh "$HOST" "sudo systemctl restart $SERVICE"
# Verify recovery
sleep 10
STATUS=$(ssh "$HOST" "systemctl is-active $SERVICE")
if [ "$STATUS" = "active" ]; then
echo "Recovery successful"
exit 0
else
echo "Recovery failed, escalating"
exit 1
fi
If the script fails, it exits non-zero and triggers escalation. If it succeeds, no one gets paged at all.
Auto-remediation is powerful but needs guardrails. Only automate remediations you fully understand, test them in staging first, and log every automated action so you can audit what happened during incidents.
Fix the alert, not just the symptom. When you find yourself resolving the same alert repeatedly, ask whether the alert should exist at all in its current form. Maybe the threshold is wrong. Maybe it's checking the wrong thing. Maybe the underlying issue should be fixed, which would make the alert unnecessary.
Use maintenance windows instead of reactive restarts. If a service needs to be restarted every few days because of a known memory leak, schedule a maintenance window for off-peak hours until the leak is patched. Stop letting it interrupt whoever's on call.
Make toil reduction part of your project work. If your team only ever does on-call duties and feature work, toil never gets addressed. Block time each sprint explicitly for reliability improvements. Even two hours per engineer per week compounds significantly over a quarter.
The toil budget
Google SRE recommends capping toil at 50% of on-call engineer time. In practice, even that ceiling is high. If you're aiming for a healthy on-call culture, 25-30% is a more realistic target.
Tracking this doesn't have to be complex. At the end of each rotation, ask engineers to estimate what percentage of their time went to interrupt-driven manual work vs. actual project or improvement work. Plot it over time.
If the toil percentage is trending up, that's a signal the team is running to stand still. Systems are getting more complex faster than the team can automate and improve. Something needs to change, whether that's headcount, prioritization, or a deliberate investment in reliability work.
If it's trending down, that's a meaningful sign that toil reduction efforts are paying off. Those improvements are often invisible to product teams and leadership because nothing broke. Make them visible by tracking and reporting the metric.
Toil rarely gets fixed on its own. It takes someone deciding it matters enough to schedule the work. The engineers who do that work are the ones who keep on-call sustainable for their teams long term.