NearIRM Team
NearIRM Team6 min read

Responding to Database Incidents: A Practical Guide

Database incidents escalate fast. A slow query becomes a connection pool issue becomes a cascading failure across every service that depends on that database. The blast radius is wide, the pressure is high, and the wrong move at the wrong moment can make things significantly worse.

This is a guide for the engineer who just got paged for something database-related.

Before You Do Anything

Your first job is to understand the scope, not fix it. Resist the urge to restart a service or kill connections before you know what you're looking at.

Start with three questions:

  • Is the database accepting connections? A complete outage is a different situation than a degraded one.
  • Which services are affected? Check your service dependency map or run a quick query on your monitoring tool.
  • When did it start? Cross-reference with recent deploys, cron jobs, or traffic spikes.

Only after you have answers should you start taking action.

Common Failure Modes and Their Signatures

Different symptoms point to different root causes. Here's a quick reference:

SymptomLikely CauseFirst Check
Queries timing outLong-running query, lock contentionSHOW PROCESSLIST (MySQL) or pg_stat_activity (Postgres)
High connection countConnection pool exhaustionActive connections vs. configured max
Slow reads, writes normalReplication lagReplica lag metrics, seconds_behind_master
Intermittent query slowdownsCache eviction, missing index, plan regressionSlow query log, EXPLAIN on slow queries
Disk I/O spikesLarge batch job, full table scan, WAL writesDisk metrics, check for sequential scans
Database won't startCorrupted data file, out of disk spaceDisk usage, error logs

Read-Only Investigation First

If you're on Postgres, these queries are your starting point. All of them are read-only.

-- Queries running longer than 5 seconds
SELECT
  pid,
  now() - query_start AS duration,
  state,
  query
FROM pg_stat_activity
WHERE (now() - query_start) > interval '5 seconds'
  AND state != 'idle'
ORDER BY duration DESC;

-- Find blocked queries and what's blocking them
SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE cardinality(pg_blocking_pids(blocked.pid)) > 0;

-- Replication lag (run on the primary)
SELECT
  client_addr,
  state,
  (sent_lsn - replay_lsn) AS replication_delay_bytes
FROM pg_stat_replication;

For MySQL, the equivalent starting points are SHOW PROCESSLIST;, SHOW ENGINE INNODB STATUS;, and checking Seconds_Behind_Master on replicas.

When to Kill a Query

Killing a long-running query is sometimes the right call, but not always. Before you do it:

  • Confirm the query is actually causing the problem, not just a symptom.
  • Check whether it's a write. Killing a long UPDATE or DELETE triggers a rollback, which can take as long as the transaction ran.
  • Check who owns it. A long-running query from your analytics pipeline is different from one holding a lock that's blocking all application writes.

If you decide to terminate in Postgres:

-- Try this first: sends an interrupt signal
SELECT pg_cancel_backend(<pid>);

-- If that doesn't work: force-terminates
SELECT pg_terminate_backend(<pid>);

Connection Pool Exhaustion

When your app can't connect to the database, the most common cause isn't the database being down. It's running out of connections.

Check where you stand:

-- Postgres
SHOW max_connections;
SELECT count(*) FROM pg_stat_activity;

-- See what's holding connections (look for 'idle' with long connection durations)
SELECT count(*), state, wait_event_type, wait_event
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY count DESC;

If you're at or near the limit, look for idle connections that have been open a long time without doing work. These are usually connection leaks in the application layer.

Short-term mitigation: if you run PgBouncer or another connection pooler, check its configuration. Restarting the pooler sometimes clears leaked connections faster than tracking down the leak source, buying you time to investigate.

Replication Lag

Replication lag is sneaky. Your writes succeed, your reads appear to work, but users see stale data. Lag spikes commonly come from:

  • A long-running transaction on the primary that blocks replication
  • A large bulk write (migration, backfill job)
  • Network congestion between primary and replica
  • The replica being under CPU pressure from replaying writes

If lag is growing and you're under time pressure, the practical options are:

  1. Route read traffic back to the primary temporarily. This increases primary load but ensures consistency.
  2. Pause the source of the large write if it's a batch job you control.
  3. Check for long-running transactions on the primary that may be holding a replication slot.

One thing to watch for: a stalled replication slot on an unresponsive replica can cause unbounded WAL accumulation on the primary. If disk usage on the primary is growing, check pg_replication_slots for any inactive slots.

SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;

If you find an inactive slot that's accumulating lag, dropping it is a judgment call. It frees disk space but means that replica will need to be re-synced.

When to Pull in More People

Not every database incident needs a DBA. But you should know when to escalate:

  • Data corruption suspected: Stop, escalate immediately. Don't run repairs without someone who knows the storage engine.
  • Schema migration gone wrong: Pull in whoever owns the migration. Rolling back a schema change is different from rolling back application code.
  • Disk full: Infrastructure or platform team. This often needs volume expansion at the host level and might involve decisions about data retention.
  • The incident spans multiple services: You probably need an incident commander at this point. Database failures tend to affect many services at once, and coordinating the response across teams is more than one person can manage.

Page someone early rather than late. A second set of eyes on a database incident is rarely wasted.

Communicating During a DB Incident

Database outages usually affect multiple services at once. Your status update should reflect that.

Don't write: "Database issue under investigation."

Write: "Postgres primary is experiencing high lock contention due to a long-running migration query. Write operations to [Service A] and [Service B] are failing. We've identified the blocking query (pid 4821, running 18 minutes) and are evaluating whether to terminate it. Estimated resolution: 20 minutes."

Specific is always better. Tell people which services are affected, what the root cause looks like, and what actions you're taking. Vague updates make people nervous and generate more noise for you.

After the Incident

Database incidents often reveal structural problems that were waiting to happen: missing indexes, queries never tested under production load, connection pool settings copied from defaults years ago.

For the postmortem, document:

  • The specific query or process that caused the problem (include the actual SQL or job name)
  • How you detected it (alert, user report, monitoring dashboard)
  • What you tried before finding the fix, and how long each step took
  • What would have made the investigation faster

The follow-up items worth putting in the backlog: enabling slow query logging if it's not already on, reviewing connection pool sizing against actual peak connections, and setting up replication lag alerts if you're running replicas without them.

The best database incidents are the ones where your runbook tells you exactly what to check. The second-best are the ones you write that runbook from.

Related Posts