NearIRM Team
NearIRM Team5 min read

How to Write Alert Notifications That Speed Up Triage

Your alert fires at 2:47 AM. The on-call engineer unlocks their phone, squints at the notification, and reads:

CRITICAL: high_error_rate Threshold exceeded

That's it. No service name, no affected region, no link to a dashboard, no hint of what "high" means or what "error rate" is being measured. The engineer spends the next four minutes just figuring out what they're looking at before they can even start fixing it.

This is the most common failure mode in on-call programs, and it has nothing to do with alert volume. It's a writing problem.

What most alert notifications get wrong

Alert conditions are usually well-thought-out. Someone spent time picking the right metric, setting a reasonable threshold, and wiring it to the right team. But the notification body, the thing a half-asleep engineer actually reads, gets maybe 30 seconds of attention.

The result is alerts that:

  • Use the metric name as the title without explaining what it means operationally
  • Include a number with no context ("latency_p99 = 2847ms" — is that bad? compared to what?)
  • Don't link to anything useful
  • Omit which environment, region, or customer is affected
  • Repeat the title in the body word-for-word

Each of these forces the engineer to context-switch before they can act. They open a browser, find the right Grafana dashboard, navigate to the right service, figure out the timeframe. By the time they're oriented, five to ten minutes have passed.

The anatomy of a useful alert notification

Think of an alert notification as a first responder briefing. It should answer: what broke, how bad is it, who's affected, and where do I look first?

A title that describes the symptom, not the metric

Bad: api_request_duration_seconds_p99 > 2 Good: API latency above SLO — checkout service

The title is the only thing visible in a phone notification preview. Make it tell the story in eight words or fewer. Use the service name and a human-readable description of the problem.

Severity with context

Don't just say CRITICAL. Include why:

Severity: CRITICAL
Reason: P99 latency at 4.2s (SLO threshold: 1s) — burning error budget at 12x normal rate

A number without a reference point means nothing at 3 AM. Always show the threshold alongside the current value.

Affected scope

The fastest path to diagnosis often starts with "is this everywhere or one region?" Put that in the notification:

Affected: us-east-1 (eu-west-1 nominal)

If you know which customers or which percentage of traffic is affected, include it. That single line changes the urgency calculus entirely.

One-click access to context

Include links directly in the notification body. At minimum:

  • Dashboard scoped to the alert's timeframe
  • Runbook for this alert
  • Recent deployments (or a link to your deploy log)

Most alerting tools support this with template variables. A link like https://grafana.internal/d/checkout?from=now-30m that opens pre-scoped to the last 30 minutes saves two minutes of navigation every single time.

Recent changes

A large fraction of incidents are caused by something that just changed: a deploy, a config push, a feature flag rollout. If your alerting system can pull in recent deploys for the affected service, do it. Even a simple "last deploy: 14 minutes ago (v2.4.1 by @jsmith)" gives the engineer a strong lead.

A before and after

Here's the same alert rewritten.

Before:

[CRITICAL] checkout_service_latency_p99_alert

checkout_service_latency_p99 = 4248

After:

[CRITICAL] Checkout service latency above SLO — us-east-1

P99 latency: 4,248ms (SLO: 1,000ms)
Affected: us-east-1 | eu-west-1 nominal
Error budget burn rate: 12x

Last deploy: 18 min ago — v2.4.1 (feature: new payment provider)

Dashboard: https://grafana.internal/d/checkout?from=now-30m
Runbook: https://wiki.internal/runbooks/checkout-latency

The second version takes maybe 30 extra seconds to read. But it replaces five minutes of orienting work with immediate action. The engineer sees the deploy timestamp and immediately has a hypothesis.

Templates to enforce consistency

The problem with good alert bodies is that they require whoever writes the alert to think carefully about what context to include. That doesn't scale.

Instead, build a standard template in your alerting system and require it for all production alerts. Most tools (PagerDuty, NearIRM, Grafana OnCall) support notification templates with variable interpolation. A template might look like:

{{ .AlertName }} — {{ .Labels.service }}

Current: {{ .Value }}{{ .Labels.unit }}  |  Threshold: {{ .Annotations.threshold }}
Affected: {{ .Labels.region }}
Severity: {{ .Labels.severity }}

{{ if .Annotations.last_deploy }}Last deploy: {{ .Annotations.last_deploy }}{{ end }}

Dashboard: {{ .Annotations.dashboard_url }}
Runbook: {{ .Annotations.runbook_url }}

Filling in runbook_url and dashboard_url as alert annotations makes them required fields when adding a new alert. Teams that do this consistently report spending less time in triage and more time actually fixing things.

Grouping and deduplication

One more thing that gets missed: when multiple alerts fire from the same incident, notification spam compounds the orientation problem. The on-call gets six pages in two minutes, each about a different symptom of the same underlying cause.

Good alerting systems let you group related alerts into a single notification with a summary. If your tool supports it, group by service or by incident. The goal is one notification that says "5 alerts fired across the checkout stack" with the top-level symptom, not five separate pages.

If your system doesn't support grouping natively, you can fake it by routing all checkout alerts to a single PagerDuty service and using its deduplication window. It's not perfect, but it prevents the worst of the pile-on.

Making it a team habit

Rewriting alert notifications only works if the team treats them as real content, not configuration boilerplate.

A few things that help:

  • Review alert bodies in code review. If alert definitions live in Terraform or Jsonnet, treat the body as seriously as the threshold.
  • Check notifications during game days. When you run incident simulations, have someone evaluate whether the notification gave them what they needed.
  • Rotate "alert hygiene" into your postmortem actions. When an incident drags because the on-call was confused about scope, make updating the alert body part of the remediation.

The friction of a bad alert notification compounds across every page. Fix them once and the improvement shows up every single time that alert fires.

Related Posts