NearIRM Team
NearIRM Team6 min read

Testing Your Alerts Before an Incident Proves They Don't Work

Most teams assume their monitoring works. They set up alerts, see them in the dashboard, and move on. Then six months later, an outage happens and the alert that should have fired at minute one fires at minute forty-three, after someone noticed the spike in customer support tickets.

Your alerting pipeline has a lot of places where things can fail silently: rules with bad logic, metrics that stop reporting, notification channels that have drifted, escalation policies pointing to people who left the company. None of these fail loudly. They fail quietly, and you find out during an incident.

Here's how to actually verify your monitoring will do what you think it will.

Start with dead man's switch alerts

A dead man's switch (sometimes called a heartbeat or watchdog alert) fires when nothing happens rather than when something goes wrong. Your system sends a heartbeat signal on a schedule, and if the alerting system doesn't receive it within a time window, it fires.

This catches a failure mode that threshold-based alerts can't: what happens when your metrics pipeline stops reporting entirely? If Prometheus stops scraping your targets, or your log shipper crashes, or your collector agent gets OOM-killed, your dashboards go empty and your alerts stay quiet, because silence looks identical to "everything is fine" to a threshold rule.

A dead man's switch inverts this. Silence becomes the signal.

In Prometheus, you can implement this by having services push a heartbeat metric on a schedule, then alerting when the metric hasn't been seen recently:

- alert: HeartbeatMissing
  expr: time() - max(heartbeat_timestamp_seconds{job="api-service"}) > 300
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "No heartbeat from {{ $labels.job }} for 5 minutes"

For Alertmanager itself, use the built-in Watchdog alert that ships with most Prometheus deployments and route it to a dead man's switch service like DeadMansSnitch or healthchecks.io. Those services page you if they stop hearing from your system.

Test your alerting rules in CI

Prometheus alerting rules are just YAML with PromQL expressions. You can unit test them before they hit production.

The promtool test rules command runs rule tests against synthetic data:

# alert_tests.yaml
rule_files:
  - alerts.yaml

tests:
  - interval: 1m
    input_series:
      - series: 'http_request_errors_total{job="api"}'
        values: '0 0 0 10 20 30 40 50 60 70'
    alert_rule_test:
      - eval_time: 5m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical
              job: api

Add this to your CI pipeline. It catches typos in metric names, wrong label selectors, and logic errors in thresholds before anything reaches production. It won't catch every class of problem, but it catches the obvious ones fast.

If you're managing your monitoring config with Terraform (for Grafana, Datadog, or similar), alert rule tests can run alongside the same plan/apply pipeline that validates your infrastructure changes.

Send real test alerts regularly

Alert rules passing unit tests doesn't mean the full pipeline works. The path from "rule fires" to "engineer gets paged" involves a lot of components: Alertmanager config, routing rules, your incident management platform, notification channels, phone numbers, and escalation policies. Any one of these can be misconfigured.

The fix is to send real alerts through the whole pipeline on a schedule.

In Alertmanager, you can POST a test alert directly to the API:

curl -XPOST http://alertmanager:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "critical",
      "job": "test"
    },
    "annotations": {
      "summary": "Scheduled test alert - please acknowledge"
    },
    "endsAt": "'"$(date -u -d '+5 minutes' +%Y-%m-%dT%H:%M:%SZ)"'"
  }]'

Run this weekly via a cron job, routed to your actual on-call rotation. The on-call engineer acknowledges it and moves on. If nobody acknowledges it, that's a signal something in the chain is broken.

Verify your escalation policies point to real people

This sounds obvious, but it's one of the most common silent failures: an escalation policy that routes to a rotation including someone who left the company three months ago. Their phone number is deactivated. The alert fires, they don't pick up, and the escalation chain hits a dead end.

Go through your escalation policies periodically and confirm that every person in every rotation has a valid, current phone number configured, has push notifications enabled if you're using app-based alerting, and actually knows they're in the rotation.

Automate this check if you can. Most incident management platforms have an API. A script that lists all schedules, pulls the current on-call person for each, and verifies they have at least one working notification method configured is worth writing once and running monthly.

Monitor the monitoring itself

Your monitoring stack can fail. Prometheus goes down. Your metrics collector runs out of memory and gets killed. Your log shipping pipeline backs up. These failures don't just reduce visibility, they make your alerts stop working entirely.

Set up external checks that verify your monitoring is alive:

Synthetic uptime checks on Prometheus itself. Hit the /metrics or /-/healthy endpoint from an external monitoring service. If it returns an error, your internal Prometheus alerts won't fire either.

Alert on absent metrics. Prometheus's absent() function fires when expected metrics stop appearing:

absent(up{job="api-service"}) == 1

If service discovery breaks or all your targets go down at once, this fires even though nothing else does.

Track your alert volume. If your systems normally generate 10-20 alerts per day and suddenly that number drops to zero, something has probably broken rather than everything simultaneously getting better. A simple check on alert count over time can catch this.

Build a validation checklist

Not everything needs to be automated. Run through this quarterly:

CheckMethod
Heartbeat alerts are configured and routing correctlyReview Alertmanager routes
Alert rules have unit tests in CICheck CI pipeline
Full-pipeline test alert sentWeekly cron, verify acknowledgement
Escalation policies have valid, current contactsQuarterly manual review
Prometheus and collector uptime monitored externallySynthetic check
Alert delivery latency tracked per incidentReview recent incident timelines

The last item matters more than most teams realize. For every real incident, track the gap between "issue started" (from logs or metrics, not when someone noticed) and "engineer acknowledged the alert." If that gap is growing, or consistently above your target response time, your alerting needs work.

What good looks like

A well-tested alerting pipeline feels boring. You get a test alert on schedule, you acknowledge it, it resolves. Your CI runs the rule tests and they pass. Quarterly review finds a couple of stale contacts you clean up.

The teams that handle incidents well aren't always the ones with the most sophisticated monitoring setups. They're the ones who trust their monitoring, because they've tested it enough to know it works. That confidence matters a lot when something breaks at 2am and you need to know whether the silence means "everything is fine" or "the alerting pipeline is broken too."

Testing your monitoring is unglamorous plumbing work. But it's a lot better than discovering the gaps when you need the system most.

Related Posts