
Responding to Database Incidents: A Practical Guide
Database incidents escalate fast. A slow query becomes a connection pool issue becomes a cascading failure across every service that depends on that database. The blast radius is wide, the pressure is high, and the wrong move at the wrong moment can make things significantly worse.
This is a guide for the engineer who just got paged for something database-related.
Before You Do Anything
Your first job is to understand the scope, not fix it. Resist the urge to restart a service or kill connections before you know what you're looking at.
Start with three questions:
- Is the database accepting connections? A complete outage is a different situation than a degraded one.
- Which services are affected? Check your service dependency map or run a quick query on your monitoring tool.
- When did it start? Cross-reference with recent deploys, cron jobs, or traffic spikes.
Only after you have answers should you start taking action.
Common Failure Modes and Their Signatures
Different symptoms point to different root causes. Here's a quick reference:
| Symptom | Likely Cause | First Check |
|---|---|---|
| Queries timing out | Long-running query, lock contention | SHOW PROCESSLIST (MySQL) or pg_stat_activity (Postgres) |
| High connection count | Connection pool exhaustion | Active connections vs. configured max |
| Slow reads, writes normal | Replication lag | Replica lag metrics, seconds_behind_master |
| Intermittent query slowdowns | Cache eviction, missing index, plan regression | Slow query log, EXPLAIN on slow queries |
| Disk I/O spikes | Large batch job, full table scan, WAL writes | Disk metrics, check for sequential scans |
| Database won't start | Corrupted data file, out of disk space | Disk usage, error logs |
Read-Only Investigation First
If you're on Postgres, these queries are your starting point. All of them are read-only.
-- Queries running longer than 5 seconds
SELECT
pid,
now() - query_start AS duration,
state,
query
FROM pg_stat_activity
WHERE (now() - query_start) > interval '5 seconds'
AND state != 'idle'
ORDER BY duration DESC;
-- Find blocked queries and what's blocking them
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE cardinality(pg_blocking_pids(blocked.pid)) > 0;
-- Replication lag (run on the primary)
SELECT
client_addr,
state,
(sent_lsn - replay_lsn) AS replication_delay_bytes
FROM pg_stat_replication;
For MySQL, the equivalent starting points are SHOW PROCESSLIST;, SHOW ENGINE INNODB STATUS;, and checking Seconds_Behind_Master on replicas.
When to Kill a Query
Killing a long-running query is sometimes the right call, but not always. Before you do it:
- Confirm the query is actually causing the problem, not just a symptom.
- Check whether it's a write. Killing a long
UPDATEorDELETEtriggers a rollback, which can take as long as the transaction ran. - Check who owns it. A long-running query from your analytics pipeline is different from one holding a lock that's blocking all application writes.
If you decide to terminate in Postgres:
-- Try this first: sends an interrupt signal
SELECT pg_cancel_backend(<pid>);
-- If that doesn't work: force-terminates
SELECT pg_terminate_backend(<pid>);
pg_cancel_backend waits for the query to stop cleanly. pg_terminate_backend is more forceful. Start with cancel, and escalate to terminate only if the query doesn't respond within 30 seconds.
Connection Pool Exhaustion
When your app can't connect to the database, the most common cause isn't the database being down. It's running out of connections.
Check where you stand:
-- Postgres
SHOW max_connections;
SELECT count(*) FROM pg_stat_activity;
-- See what's holding connections (look for 'idle' with long connection durations)
SELECT count(*), state, wait_event_type, wait_event
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY count DESC;
If you're at or near the limit, look for idle connections that have been open a long time without doing work. These are usually connection leaks in the application layer.
Short-term mitigation: if you run PgBouncer or another connection pooler, check its configuration. Restarting the pooler sometimes clears leaked connections faster than tracking down the leak source, buying you time to investigate.
Replication Lag
Replication lag is sneaky. Your writes succeed, your reads appear to work, but users see stale data. Lag spikes commonly come from:
- A long-running transaction on the primary that blocks replication
- A large bulk write (migration, backfill job)
- Network congestion between primary and replica
- The replica being under CPU pressure from replaying writes
If lag is growing and you're under time pressure, the practical options are:
- Route read traffic back to the primary temporarily. This increases primary load but ensures consistency.
- Pause the source of the large write if it's a batch job you control.
- Check for long-running transactions on the primary that may be holding a replication slot.
One thing to watch for: a stalled replication slot on an unresponsive replica can cause unbounded WAL accumulation on the primary. If disk usage on the primary is growing, check pg_replication_slots for any inactive slots.
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
If you find an inactive slot that's accumulating lag, dropping it is a judgment call. It frees disk space but means that replica will need to be re-synced.
When to Pull in More People
Not every database incident needs a DBA. But you should know when to escalate:
- Data corruption suspected: Stop, escalate immediately. Don't run repairs without someone who knows the storage engine.
- Schema migration gone wrong: Pull in whoever owns the migration. Rolling back a schema change is different from rolling back application code.
- Disk full: Infrastructure or platform team. This often needs volume expansion at the host level and might involve decisions about data retention.
- The incident spans multiple services: You probably need an incident commander at this point. Database failures tend to affect many services at once, and coordinating the response across teams is more than one person can manage.
Page someone early rather than late. A second set of eyes on a database incident is rarely wasted.
Communicating During a DB Incident
Database outages usually affect multiple services at once. Your status update should reflect that.
Don't write: "Database issue under investigation."
Write: "Postgres primary is experiencing high lock contention due to a long-running migration query. Write operations to [Service A] and [Service B] are failing. We've identified the blocking query (pid 4821, running 18 minutes) and are evaluating whether to terminate it. Estimated resolution: 20 minutes."
Specific is always better. Tell people which services are affected, what the root cause looks like, and what actions you're taking. Vague updates make people nervous and generate more noise for you.
After the Incident
Database incidents often reveal structural problems that were waiting to happen: missing indexes, queries never tested under production load, connection pool settings copied from defaults years ago.
For the postmortem, document:
- The specific query or process that caused the problem (include the actual SQL or job name)
- How you detected it (alert, user report, monitoring dashboard)
- What you tried before finding the fix, and how long each step took
- What would have made the investigation faster
The follow-up items worth putting in the backlog: enabling slow query logging if it's not already on, reviewing connection pool sizing against actual peak connections, and setting up replication lag alerts if you're running replicas without them.
The best database incidents are the ones where your runbook tells you exactly what to check. The second-best are the ones you write that runbook from.