
Synthetic Monitoring: Catch Incidents Before Your Users Do
Most alerting is reactive. A service throws errors, metrics spike, and someone gets paged. By the time an alert fires, real users have already hit the problem.
Synthetic monitoring flips that around. Instead of waiting for something to break, you simulate user behavior on a schedule and alert when those simulated requests fail. You become the canary.
What synthetic monitoring actually is
A synthetic monitor runs a scripted check against your system at regular intervals, from one or more locations. It could be as simple as an HTTP GET to your homepage, or as involved as a multi-step browser session that logs in, adds an item to a cart, and checks out.
The key word is "scripted." The checks are deterministic and repeatable. You define what success looks like, and if the check fails, you know something changed.
This is different from real-user monitoring (RUM), which instruments actual user sessions. RUM gives you a picture of what users experience in aggregate; synthetics give you an immediate, continuous signal from a known baseline.
Neither replaces the other. Synthetics catch outages fast; RUM tells you how bad things actually were for users.
Types of checks worth running
Uptime checks are the simplest: ping a URL, expect a 200. They're cheap to run and catch obvious availability problems. Don't rely on them alone.
API checks go a level deeper. You send a request with a real payload and assert on the response body, headers, and latency. If your /api/auth/token endpoint starts returning 401s for valid credentials, an uptime check on your homepage won't catch it.
Browser checks (sometimes called end-to-end synthetics) run a headless browser through a scripted user flow. These are slower and more fragile, but they're the only way to catch issues in client-side rendering, JavaScript execution, or third-party widget loading.
Multi-step API flows fall between API checks and full browser automation. You chain a series of API calls, passing state between them, to simulate a workflow without a browser. A checkout flow might look like:
steps:
- name: Create cart
method: POST
url: https://api.example.com/carts
extract:
cart_id: $.id
- name: Add item
method: POST
url: https://api.example.com/carts/{{cart_id}}/items
body:
product_id: "sku-42"
quantity: 1
assert:
status: 200
- name: Start checkout
method: POST
url: https://api.example.com/checkout
body:
cart_id: "{{cart_id}}"
assert:
status: 200
body.status: pending
If any step fails, you know exactly where the workflow broke.
What to monitor (and what to skip)
Start with your critical user paths: login, signup, checkout, core API endpoints, anything that directly generates revenue or is in your SLAs.
A decent starting checklist:
| Check type | Target | Interval |
|---|---|---|
| Uptime | Homepage, docs, status page | 1 min |
| API | Auth endpoints, core CRUD | 1 min |
| Multi-step | Signup flow, checkout flow | 5 min |
| Browser | Main user journey | 10 min |
Don't monitor everything. Adding checks for every internal endpoint creates noise without adding coverage. Focus on what a user would actually do.
Run checks from multiple regions if you have a geographically distributed user base. An outage in us-east-1 won't show up if your synthetic checks only run from eu-west-1.
Setting thresholds that don't cry wolf
The default threshold for most synthetic tools is binary: the check passed or failed. That works for availability, but latency degradations are subtler.
Set a latency threshold alongside your availability check. If your checkout API normally responds in 200ms and it starts taking 2 seconds, that's not an outage by uptime definition, but it's a bad user experience and often a leading indicator of a bigger problem.
A reasonable starting point: alert when latency exceeds 3x your p95 baseline over a 5-minute window. Adjust from there based on what you actually observe.
Avoid alerting on a single check failure. One failed request might be a blip in the monitoring infrastructure itself. Alert after two or three consecutive failures, or after a failure rate threshold (e.g., 2 of 3 checks failed in the last 3 minutes).
Plugging synthetics into your incident workflow
A synthetic check failing is an incident trigger, same as any other alert. It should route into your on-call system with enough context to act on immediately.
The alert notification should include:
- Which check failed (name, URL, step)
- What the failure was (status code, assertion that failed, error message)
- How long it's been failing
- A link to the check run with full request/response details
If your synthetic check fires at 3am, the on-call engineer needs to know within 30 seconds whether they're looking at a full site down or a broken checkout flow on one specific region. Vague alerts waste time.
Tie synthetic failures to your severity definitions. A homepage uptime check failing is probably a SEV-1. An end-to-end browser check for a secondary feature failing at 2am might be a SEV-3 that waits for business hours.
False positives and keeping checks healthy
Synthetics break for reasons that have nothing to do with your production systems. The monitoring provider has an outage. A network route flaps. A browser check starts failing because a third-party script changed its DOM structure.
Some things that help:
- Use multiple locations and require agreement. Alert only when 2+ regions see the same failure. This filters out single-region probe issues.
- Treat check code like production code. Review changes to synthetic scripts. If a check starts failing after a frontend deploy, it might be your check that needs updating, not your app.
- Track false positive rate. If a check fires and the on-call engineer closes it without incident, log that. High false positive rates from a specific check mean the check is misconfigured or monitoring something too fragile.
Over time you'll tune checks so that when they fire, something is actually wrong. That trust is what makes synthetic monitoring useful at 2am.
Tooling landscape
Most observability platforms include synthetic monitoring now. Datadog Synthetics, Checkly, Grafana's synthetic monitoring plugin, New Relic Synthetics, and Pingdom are common choices. Smaller, focused options like Better Uptime or Uptime Robot work well for pure uptime monitoring.
If you're running Prometheus and Grafana already, the Grafana synthetic monitoring plugin adds basic checks without pulling in another vendor.
For teams that want full control, you can run your own probes with tools like Playwright (for browser checks) or k6 (for API and multi-step checks) on a schedule, pushing results to your metrics system.
The right choice depends on how complex your checks need to be. For uptime and basic API monitoring, any hosted solution works. For multi-step browser flows with assertions on specific UI elements, you'll want something with solid scripting support.
The goal: fewer surprises
Synthetic monitoring doesn't prevent incidents. It doesn't make your systems more reliable. What it does is shorten the gap between "something broke" and "someone knows something broke."
For most teams, the hardest part of incident response isn't fixing the problem. It's finding out the problem exists in the first place. Reducing that detection window, even by a few minutes, means fewer users affected and less time spent in reactive fire-fighting mode.
Start with one critical user path. Get the alerting right. Build from there.