Alert Correlation: How to Group Related Alerts and Cut the Noise

When a database goes down, your monitoring stack doesn't page you once. It pages you for high query latency, connection pool exhaustion, failed health checks, a spiking error rate in the API that queries it, and a separate SLO burn rate alert that fired thirty seconds later. That's five pages for one root cause.

Alert correlation is the practice of recognizing that those five events belong together and surfacing them as a single incident rather than five separate interruptions. Getting it right makes a meaningful difference to on-call quality of life and response time.

Deduplication is not correlation

These two terms get used interchangeably, but they solve different problems.

Deduplication suppresses repeat firings of the same alert. If your CPU alert fires every minute while the condition holds, deduplication ensures your on-call engineer gets one page, not sixty. Most alerting systems do this automatically.

Correlation is the harder problem: recognizing that different alerts from different sources are symptoms of the same underlying issue. A CPU alert and a request timeout alert are separate signals. Correlation connects them.

You can have perfect deduplication and still flood your team with correlated noise. Most teams don't solve the second problem.

Three approaches to correlation

1. Time-window grouping

The simplest form: any alerts firing within a short time window get grouped into one incident.

# Example: group alerts that fire within 5 minutes of each other
grouping:
  window: 5m
  group_by:
    - alertname
    - cluster

Alertmanager's group_wait and group_interval settings work this way. The downside is that it groups by proximity, not by cause. An unrelated alert that fires at the same time can end up bundled with your actual incident.

Time-window grouping is easy to set up and catches a lot of correlated noise. It's a reasonable starting point for most teams.

2. Topology-based correlation

Here you model the relationships between your services, and when alerts fire on dependent services, you correlate them upward to the root cause.

If your payment service depends on your database, and both start alerting, topology-based correlation tells you the payment service alert is likely a downstream effect. The incident gets opened on the database; the payment service alert gets attached as a symptom.

This requires maintaining a service dependency map, which is its own project. If you already have a service catalog, you're partway there. If not, start with the five or ten most-paged services and build the map incrementally.

The benefit over time-window grouping is precision. You're not just grouping things that fired around the same time; you're grouping things that are structurally related.

3. Label-based grouping

This is what Alertmanager calls "inhibition rules." If an alert with label severity=critical fires, it suppresses related alerts with matching labels.

inhibit_rules:
  - source_match:
      alertname: DatabaseDown
    target_match:
      service: payment-api
    equal:
      - cluster
      - region

Label-based grouping is precise but requires upfront investment in consistent labeling across your alert definitions. If half your alerts don't have a service label, this won't work well. A labeling audit pays off before you try to build correlation on top of it.

Flapping alerts

Flapping is when an alert repeatedly fires and resolves, fires and resolves, because a condition is bouncing around a threshold. This creates a stream of "firing" and "resolved" notifications that accomplishes nothing except interrupting your team.

The fix is alert evaluation windows. Instead of alerting the moment a condition is met, require the condition to hold for a minimum duration:

# Prometheus: only fire if condition holds for 10 minutes
- alert: HighErrorRate
  expr: rate(http_errors_total[5m]) > 0.05
  for: 10m

The for duration means transient spikes don't trigger pages. You're alerting on sustained problems, not momentary blips.

For conditions that naturally fluctuate (memory usage, connection counts), add hysteresis: fire at 90%, resolve only when it drops below 80%. This prevents the alert from toggling at exactly the threshold.

Flapping often reveals that a threshold is set too close to normal operating range. If an alert fires and resolves multiple times per day, the threshold probably needs adjustment, not just a suppression rule.

Practical setup: where to start

If you're starting from scratch, here's a sequence that works:

Week 1: Audit your alert volume. For each on-call rotation over the past month, count how many alerts fired per incident. If the ratio is higher than 2:1 (two alerts per actual incident), you have a correlation problem worth solving.

Week 2: Add consistent labels. Every alert should have at minimum: service, team, severity, and environment. This is the foundation everything else builds on.

Week 3: Enable time-window grouping. In Alertmanager or your alerting platform, configure a 5-minute grouping window. This alone typically cuts paging volume by 30-50% for teams with dense alert configs.

Month 2: Map your top 10 services. Pick the services that generate the most alerts and document their direct dependencies. Use this to write your first inhibition rules.

Ongoing: Review after each major incident. Ask whether the alerts you received during the incident were all necessary, or whether some were symptoms that could have been suppressed once the root cause was identified.

What good correlation looks like in practice

Here's a before/after for a typical database incident:

Before correlation	After correlation
Page: `DatabaseHighLatency`	Page: `DatabaseHighLatency` (primary)
Page: `PaymentAPIErrorRate`	Attached: `PaymentAPIErrorRate` (symptom)
Page: `UserServiceTimeout`	Attached: `UserServiceTimeout` (symptom)
Page: `SLOBurnRateHigh`	Attached: `SLOBurnRateHigh` (symptom)
Page: `HealthCheckFailed`	Attached: `HealthCheckFailed` (symptom)

The on-call engineer gets one page with four attached context signals. They know the database is the likely root cause before they even open a terminal.

The response is faster because the context is pre-assembled. The engineer doesn't need to manually determine whether five separate alerts are related; that work happened automatically.

The limits of automated correlation

Correlation systems work well for incidents that match known patterns: a database goes down, dependent services follow. They work less well for novel failure modes or incidents that span multiple unrelated services simultaneously.

Don't over-invest in automated correlation at the expense of good runbooks and fast escalation paths. A well-labeled alert with a clear runbook link gets an engineer to the root cause faster than a sophisticated correlation system with poor documentation.

Build correlation to reduce the signal-to-noise ratio. Build runbooks and escalation policies to make sure that once the right signal surfaces, someone can act on it quickly.

The goal isn't zero pages. It's pages that mean something.