
Roll Back or Roll Forward? Making the Call During a Deploy Incident
A deploy went out twenty minutes ago. Errors are spiking. Latency is up. Your first instinct is to roll back, but someone on the channel says a fix is already staged and could be out in five minutes.
This is one of the most common pressure points in incident response, and it's the kind of decision that gets made badly when teams haven't thought it through in advance. Here's a framework for making the call clearly, fast.
First: Confirm the deploy is actually the cause
Don't assume. Correlation isn't causation, and declaring a deploy the culprit before you've verified it wastes time if something else is wrong.
Check these things before picking a path:
- Deploy timing vs. error onset. Did errors start within a few minutes of the deploy finishing? If there's a 40-minute gap, the deploy is less likely to be the direct cause.
- Which services degraded. If the deploy touched Service A but errors are in Service B, look at dependencies. Was Service B calling something Service A changed?
- What changed. A one-line config change and a 200-file refactor carry very different risk profiles. Look at the diff.
- Is this reproducible? Can you hit the error in staging or with a test request? If yes, the cause is more likely code. If no, think about infrastructure, traffic patterns, or data.
If you can't confirm the deploy is the cause within five minutes, keep investigating while someone prepares the rollback option so it's ready if needed.
The two options
Roll back: Revert to the previous known-good version. The incident stops (assuming the deploy is the cause), but you've temporarily lost whatever the deploy was supposed to deliver.
Roll forward: Push a fix on top of the current broken version. The incident resolves, and the original changes stay in place.
Neither option is always right. The choice depends on what broke, how bad it is, and how quickly a fix is available.
When to roll back
Roll back when:
- The cause is unclear and the impact is high. If you know the deploy is correlated with the incident but don't know exactly why, and users are being affected right now, rolling back buys you time to diagnose safely. You can redeploy once you understand the problem.
- The fix isn't ready. If someone says "I think I know the fix" but it hasn't been written and tested yet, that estimate is probably optimistic under pressure. Rolling back while the fix gets properly built is almost always faster total.
- The change is easy to revert. If your deploy system supports one-click rollback and your service is stateless, this is low risk and fast.
- The deploy introduced a regression in a critical path. Checkout broken, auth broken, data writes failing. These are cases where a partial fix-forward approach creates its own risk.
Rolling back a database migration is a special case. If the deploy included a schema change, check whether the previous version of the application is compatible with the current schema before you revert. In many cases, rolling back the application code is safe but rolling back the migration is not.
When to roll forward
Roll forward when:
- The fix is already written, tested, and staged. "Five minutes away" only counts if the code exists and someone has actually verified it works. If that's true, rolling forward is reasonable.
- Rolling back is dangerous. Some deploys are hard to reverse safely. If the deploy included a migration that can't be cleanly undone, or a data transform that already ran, rolling back the code without rolling back the data may make things worse.
- The impact is low and contained. A bug affecting 2% of users in a non-critical flow doesn't justify the operational overhead of a rollback if a clean fix is minutes away.
- The rollback itself has known risks. If the previous version had a bug you just fixed in a different area, reverting to it trades one problem for another.
The decision checklist
When you're in the middle of the incident and need to pick a path, run through this:
| Question | Roll back | Roll forward |
|---|---|---|
| Is the impact high (data loss, full outage, payments)? | Yes | No |
| Is a tested fix ready right now? | No | Yes |
| Is the rollback clean (no migrations, stateless)? | Yes | No |
| Is the cause confirmed? | Doesn't matter | Prefer yes |
| Can the team afford to wait for a fix? | No | Yes |
If the answers point in different directions, bias toward rolling back. The cost of a rollback is usually a few minutes of deploy time and some lost feature time. The cost of a failed roll-forward is a longer incident.
Executing the rollback
Once you've decided to roll back:
- Announce it in the incident channel. "Rolling back to v1.4.2, ETA 3 minutes." This keeps everyone oriented and prevents parallel actions.
- Use your deploy system's native rollback, not a manual revert. Most CI/CD tools have a rollback button or command that redeploys the last successful artifact. This is faster and less error-prone than creating a new revert commit under pressure.
- Watch the deploy complete. Don't declare success until the rollback has fully deployed and your error rates are recovering. Partial rollbacks happen.
- Verify the fix. Once errors are down, confirm with a test request or by checking your dashboards. The incident isn't resolved until you've seen evidence the service is healthy, not just that the deploy finished.
- Set a follow-up action. Before you close the incident, write down what needs to happen before the original change ships again. This usually means a task in your issue tracker with the broken change and the required fix.
After the rollback: don't skip the investigation
The rollback stops the bleeding, but the root cause work matters. Without it, the same deploy (or a similar one) causes the same incident again.
Write down:
- What the deploy changed
- Which part of that change caused the incident
- Why the issue wasn't caught in staging or review
- What would catch it next time
This doesn't have to be a full postmortem if the incident was small. A few bullet points in your incident notes is enough. The point is to create a record so the team can act on it.
Build the decision into your runbooks
If your team regularly ships code, this decision will come up again. The worst time to build your framework is during an active incident when someone is asking "do we roll back?" and nobody has a clear answer.
Write a short runbook that documents your team's defaults:
- How to trigger a rollback in your deploy system
- Which signal confirms the deploy is the cause
- Who makes the final call (incident commander, on-call engineer, team lead)
- What constitutes "fix is ready" for roll-forward purposes
Having that written down, even as a short checklist, takes one decision off the plate when things are already stressful.
The fastest resolution to a deploy incident isn't always the rollback and it isn't always the hotfix. It's having a clear process so you spend your time fixing the problem, not arguing about which path to take.