Roll Back or Roll Forward? Making the Call During a Deploy Incident
A practical framework for deciding whether to revert a bad deploy or push a hotfix during an active incident.
Product updates, on-call war stories, and what we've learned building alert routing from scratch.
38 posts
A practical framework for deciding whether to revert a bad deploy or push a hotfix during an active incident.
Most alert messages waste the first five minutes of an incident. Here's how to fix them.
Cascading failures turn minor issues into full outages. Learn to recognize the patterns, stop the spread, and design systems that fail more gracefully.
A practical guide to preparing engineers for on-call, covering runbooks, shadow shifts, escalation paths, and readiness checks.
How to use feature flags as emergency kill switches during incidents, with implementation patterns and on-call workflow tips.
A practical guide to planning and running incident simulation exercises that build team confidence and surface real gaps in your on-call process.
A practical guide to creating and maintaining a service catalog that cuts confusion during incidents and keeps ownership clear.
A practical guide to multi-window, multi-burn-rate alerting that pages you when it matters and stays quiet when it doesn't.
Toil is the silent killer of on-call morale. Here's how to spot it, quantify it, and actually get rid of it.