The First 10 Minutes of an Incident: A Triage Checklist
What to do when you get paged at 2am. A practical triage checklist for the critical first minutes of any incident.
Product updates, on-call war stories, and what we've learned building alert routing from scratch.
28 posts
What to do when you get paged at 2am. A practical triage checklist for the critical first minutes of any incident.
How to use chat platforms like Slack to coordinate incidents without turning your #incidents channel into chaos.
What synthetic monitoring is, how it differs from real-user monitoring, and how to set it up to detect outages proactively.
A practical guide to writing Service Level Objectives that drive real decisions, not just sit in a doc nobody reads.
Error budgets translate abstract SLO targets into concrete on-call decisions. Here's how to calculate them and actually use them.
A practical guide to incident status pages: what to say, when to say it, and how to avoid the common mistakes that erode customer trust.
A practical guide to round-robin, primary/secondary, follow-the-sun, and squad-based on-call rotations, with guidance on when to use each.
What an incident commander actually does, when to assign one, and how to train your team for structured incident response.
Most runbooks rot in a wiki nobody reads. Here's how to write ones that genuinely help engineers resolve incidents faster.