
Game Days: How to Run Incident Simulations That Actually Help
Most teams find out their runbooks are broken during a real outage at 2am. That's a terrible time to discover that the person listed as the database escalation contact left six months ago.
Game days are how you find those gaps before they cost you. The concept is simple: deliberately simulate an incident so your team practices responding to one. The execution is where teams tend to go wrong.
What a game day actually is
A game day is a structured exercise where you inject a failure (real or simulated) into a system and have your on-call team respond as if it were a live incident. It's not a demo and it's not a tabletop talk-through, though tabletops have their own value.
The defining feature is that someone doesn't know exactly what's broken. The responder has to triage, investigate, and resolve the issue just like they would at 3am on a Saturday.
There are three common formats:
| Format | How it works | Best for |
|---|---|---|
| Tabletop | Walk through a scenario verbally, no systems touched | New teams, testing processes and communication |
| Simulated alert | Inject a fake alert into your alerting system, respond normally | Testing paging, escalation, and runbook quality |
| Live chaos injection | Use a chaos tool (or manual action) to break something real | Teams with production-safe chaos infrastructure |
Most teams should start with tabletops and simulated alerts before touching production.
Why bother if things are mostly fine
A few reasons that aren't obvious until something goes wrong:
Runbooks rot. You write a runbook in January. By April, two services have been renamed, one dependency was swapped out, and the dashboard link 404s. Nobody updates it because nobody has a reason to read it until there's an incident. Game days give you a reason to read runbooks under pressure, which is when the gaps show up.
New team members haven't seen a bad incident. If someone joined after your last major outage, they've never used your incident tooling under stress. Their first time shouldn't be during a real P1.
Communication patterns break under pressure. Teams that coordinate fine during normal work can get into trouble during incidents. Who's updating the status page? Who's talking to the customer team? Game days surface coordination problems before they matter.
Your alerting probably has gaps. When you simulate an incident and your alert doesn't fire, that's useful information. When it fires three minutes later than you expected, that's also useful.
How to plan one
Pick a scenario that's realistic
Don't invent a space-alien scenario. Use something that either happened before or plausibly could: a database going read-only, a cache layer falling over and causing latency spikes, a third-party API returning 500s, a deploy that accidentally doubled memory usage.
The best source of scenarios is your post-mortems. If something happened once, it can happen again. Running a drill based on a real previous incident also lets you validate whether the fixes you implemented actually work.
Decide what's in scope
Be explicit about blast radius before you start. For a tabletop, it's zero. For a simulated alert, you're touching your alerting and notification systems. For live chaos injection, you need to agree on which environments or services are fair game.
Some teams run game days in staging. That's fine for testing runbooks, but staging environments often behave differently from production, so you might not catch real latency or load-dependent issues. Others run them in production during low-traffic windows with feature flags or kill switches ready. Know what you're doing before you start.
Have a game master
One person coordinates the exercise and knows what the injected failure is. Everyone else responds normally. The game master watches what happens, takes notes, and can inject complications if the team is finding things too easy (or call it early if something's going wrong in an unexpected way).
The game master isn't the incident commander. They're outside the exercise, observing.
Timebox it
Set a hard end time. If it's a tabletop, 90 minutes is usually enough. If it's a live drill, 45-60 minutes works for most scenarios. Running over signals the scenario was too complex or the team got genuinely stuck, which is itself useful information.
Running the exercise
Start the drill like you'd start a real incident: fire the alert (or announce the scenario), and let the team respond. Don't coach during the exercise. Let people get confused, look things up, and make mistakes. That's the point.
The game master watches for:
- How long it takes to acknowledge and start investigating
- Whether the right people get paged or notified
- Whether anyone checks the runbook, and what happens when they do
- Whether the team communicates about what they're trying and what they're finding
- Where people get stuck
Don't intervene to help the team during the drill unless something is genuinely going wrong in a way that could cause real harm. Getting stuck is part of the exercise.
After the resolution (or after time runs out), do a short debrief while it's fresh. What did the team notice? What was harder than expected? What didn't exist and should?
The debrief is where the value is
A game day without a debrief is just a stressful exercise. The debrief is how you turn observations into improvements.
Keep it blameless, same as a post-mortem. The goal is to understand what the system (including processes and tooling) made difficult, not to evaluate individual performance.
Some questions that help:
- What was the first thing you checked? Why? Did it help?
- Was there a point where you weren't sure what to do next?
- Which documentation did you reach for? Was it accurate?
- Was there something you needed that didn't exist?
- What would have made this 50% faster?
Document what comes out of this. Broken runbook links, missing escalation paths, alerts that fired late, communication gaps. These become action items, same as any other incident.
Common mistakes
Making the scenario too hard. If the team can't find or fix the issue, you don't learn much except that the scenario was unrealistic. Start simple.
Not running them regularly. A single game day every two years isn't useful. Teams change, systems change. Once per quarter is a reasonable starting cadence for a team that's never done them. Twice a year is fine once the practice is established.
Only drilling the happy path. Most teams practice the failure mode but not the communication breakdown. What happens if your incident bridge drops? What happens if the person with the database credentials is unreachable? Those scenarios are worth running too.
Treating it as a test. If people feel like they're being evaluated, they'll play it safe and you won't see the real gaps. Frame it explicitly as a learning exercise, not a performance review.
Skipping the action items. If the debrief produces a list of five things that need fixing and nothing gets fixed, the next game day will find the same problems. That's demoralizing. Assign owners, put the items in your backlog, and check on them.
A simple starting point
If you've never run a game day, here's a low-effort first one:
- Pick a past incident from your post-mortems
- Reconstruct the initial symptom (what alert fired, what the dashboard looked like)
- Run a 60-minute tabletop where one person plays the incident commander and others play supporting roles
- Give people the original alert context but not the post-mortem
- Work through what they'd do, step by step
- At the end, share the post-mortem and compare
You'll learn something. Usually several things.
Game days feel like overhead until you've seen one surface a critical gap before a real incident does. Then they feel like cheap insurance.