
How to Write Alert Notifications That Speed Up Triage
Your alert fires at 2:47 AM. The on-call engineer unlocks their phone, squints at the notification, and reads:
CRITICAL: high_error_rate Threshold exceeded
That's it. No service name, no affected region, no link to a dashboard, no hint of what "high" means or what "error rate" is being measured. The engineer spends the next four minutes just figuring out what they're looking at before they can even start fixing it.
This is the most common failure mode in on-call programs, and it has nothing to do with alert volume. It's a writing problem.
What most alert notifications get wrong
Alert conditions are usually well-thought-out. Someone spent time picking the right metric, setting a reasonable threshold, and wiring it to the right team. But the notification body, the thing a half-asleep engineer actually reads, gets maybe 30 seconds of attention.
The result is alerts that:
- Use the metric name as the title without explaining what it means operationally
- Include a number with no context ("latency_p99 = 2847ms" — is that bad? compared to what?)
- Don't link to anything useful
- Omit which environment, region, or customer is affected
- Repeat the title in the body word-for-word
Each of these forces the engineer to context-switch before they can act. They open a browser, find the right Grafana dashboard, navigate to the right service, figure out the timeframe. By the time they're oriented, five to ten minutes have passed.
The anatomy of a useful alert notification
Think of an alert notification as a first responder briefing. It should answer: what broke, how bad is it, who's affected, and where do I look first?
A title that describes the symptom, not the metric
Bad: api_request_duration_seconds_p99 > 2
Good: API latency above SLO — checkout service
The title is the only thing visible in a phone notification preview. Make it tell the story in eight words or fewer. Use the service name and a human-readable description of the problem.
Severity with context
Don't just say CRITICAL. Include why:
Severity: CRITICAL
Reason: P99 latency at 4.2s (SLO threshold: 1s) — burning error budget at 12x normal rate
A number without a reference point means nothing at 3 AM. Always show the threshold alongside the current value.
Affected scope
The fastest path to diagnosis often starts with "is this everywhere or one region?" Put that in the notification:
Affected: us-east-1 (eu-west-1 nominal)
If you know which customers or which percentage of traffic is affected, include it. That single line changes the urgency calculus entirely.
One-click access to context
Include links directly in the notification body. At minimum:
- Dashboard scoped to the alert's timeframe
- Runbook for this alert
- Recent deployments (or a link to your deploy log)
Most alerting tools support this with template variables. A link like https://grafana.internal/d/checkout?from=now-30m that opens pre-scoped to the last 30 minutes saves two minutes of navigation every single time.
Recent changes
A large fraction of incidents are caused by something that just changed: a deploy, a config push, a feature flag rollout. If your alerting system can pull in recent deploys for the affected service, do it. Even a simple "last deploy: 14 minutes ago (v2.4.1 by @jsmith)" gives the engineer a strong lead.
A before and after
Here's the same alert rewritten.
Before:
[CRITICAL] checkout_service_latency_p99_alert
checkout_service_latency_p99 = 4248
After:
[CRITICAL] Checkout service latency above SLO — us-east-1
P99 latency: 4,248ms (SLO: 1,000ms)
Affected: us-east-1 | eu-west-1 nominal
Error budget burn rate: 12x
Last deploy: 18 min ago — v2.4.1 (feature: new payment provider)
Dashboard: https://grafana.internal/d/checkout?from=now-30m
Runbook: https://wiki.internal/runbooks/checkout-latency
The second version takes maybe 30 extra seconds to read. But it replaces five minutes of orienting work with immediate action. The engineer sees the deploy timestamp and immediately has a hypothesis.
Templates to enforce consistency
The problem with good alert bodies is that they require whoever writes the alert to think carefully about what context to include. That doesn't scale.
Instead, build a standard template in your alerting system and require it for all production alerts. Most tools (PagerDuty, NearIRM, Grafana OnCall) support notification templates with variable interpolation. A template might look like:
{{ .AlertName }} — {{ .Labels.service }}
Current: {{ .Value }}{{ .Labels.unit }} | Threshold: {{ .Annotations.threshold }}
Affected: {{ .Labels.region }}
Severity: {{ .Labels.severity }}
{{ if .Annotations.last_deploy }}Last deploy: {{ .Annotations.last_deploy }}{{ end }}
Dashboard: {{ .Annotations.dashboard_url }}
Runbook: {{ .Annotations.runbook_url }}
Filling in runbook_url and dashboard_url as alert annotations makes them required fields when adding a new alert. Teams that do this consistently report spending less time in triage and more time actually fixing things.
If you're auditing existing alerts, filter for any notification where the body and title are identical, or where neither runbook_url nor dashboard_url is set. Those are your worst offenders.
Grouping and deduplication
One more thing that gets missed: when multiple alerts fire from the same incident, notification spam compounds the orientation problem. The on-call gets six pages in two minutes, each about a different symptom of the same underlying cause.
Good alerting systems let you group related alerts into a single notification with a summary. If your tool supports it, group by service or by incident. The goal is one notification that says "5 alerts fired across the checkout stack" with the top-level symptom, not five separate pages.
If your system doesn't support grouping natively, you can fake it by routing all checkout alerts to a single PagerDuty service and using its deduplication window. It's not perfect, but it prevents the worst of the pile-on.
Making it a team habit
Rewriting alert notifications only works if the team treats them as real content, not configuration boilerplate.
A few things that help:
- Review alert bodies in code review. If alert definitions live in Terraform or Jsonnet, treat the body as seriously as the threshold.
- Check notifications during game days. When you run incident simulations, have someone evaluate whether the notification gave them what they needed.
- Rotate "alert hygiene" into your postmortem actions. When an incident drags because the on-call was confused about scope, make updating the alert body part of the remediation.
The friction of a bad alert notification compounds across every page. Fix them once and the improvement shows up every single time that alert fires.