NearIRM Team
NearIRM Team6 min read

Synthetic Monitoring: Catch Incidents Before Your Users Do

Most alerting is reactive. A service throws errors, metrics spike, and someone gets paged. By the time an alert fires, real users have already hit the problem.

Synthetic monitoring flips that around. Instead of waiting for something to break, you simulate user behavior on a schedule and alert when those simulated requests fail. You become the canary.

What synthetic monitoring actually is

A synthetic monitor runs a scripted check against your system at regular intervals, from one or more locations. It could be as simple as an HTTP GET to your homepage, or as involved as a multi-step browser session that logs in, adds an item to a cart, and checks out.

The key word is "scripted." The checks are deterministic and repeatable. You define what success looks like, and if the check fails, you know something changed.

This is different from real-user monitoring (RUM), which instruments actual user sessions. RUM gives you a picture of what users experience in aggregate; synthetics give you an immediate, continuous signal from a known baseline.

Neither replaces the other. Synthetics catch outages fast; RUM tells you how bad things actually were for users.

Types of checks worth running

Uptime checks are the simplest: ping a URL, expect a 200. They're cheap to run and catch obvious availability problems. Don't rely on them alone.

API checks go a level deeper. You send a request with a real payload and assert on the response body, headers, and latency. If your /api/auth/token endpoint starts returning 401s for valid credentials, an uptime check on your homepage won't catch it.

Browser checks (sometimes called end-to-end synthetics) run a headless browser through a scripted user flow. These are slower and more fragile, but they're the only way to catch issues in client-side rendering, JavaScript execution, or third-party widget loading.

Multi-step API flows fall between API checks and full browser automation. You chain a series of API calls, passing state between them, to simulate a workflow without a browser. A checkout flow might look like:

steps:
  - name: Create cart
    method: POST
    url: https://api.example.com/carts
    extract:
      cart_id: $.id

  - name: Add item
    method: POST
    url: https://api.example.com/carts/{{cart_id}}/items
    body:
      product_id: "sku-42"
      quantity: 1
    assert:
      status: 200

  - name: Start checkout
    method: POST
    url: https://api.example.com/checkout
    body:
      cart_id: "{{cart_id}}"
    assert:
      status: 200
      body.status: pending

If any step fails, you know exactly where the workflow broke.

What to monitor (and what to skip)

Start with your critical user paths: login, signup, checkout, core API endpoints, anything that directly generates revenue or is in your SLAs.

A decent starting checklist:

Check typeTargetInterval
UptimeHomepage, docs, status page1 min
APIAuth endpoints, core CRUD1 min
Multi-stepSignup flow, checkout flow5 min
BrowserMain user journey10 min

Don't monitor everything. Adding checks for every internal endpoint creates noise without adding coverage. Focus on what a user would actually do.

Run checks from multiple regions if you have a geographically distributed user base. An outage in us-east-1 won't show up if your synthetic checks only run from eu-west-1.

Setting thresholds that don't cry wolf

The default threshold for most synthetic tools is binary: the check passed or failed. That works for availability, but latency degradations are subtler.

Set a latency threshold alongside your availability check. If your checkout API normally responds in 200ms and it starts taking 2 seconds, that's not an outage by uptime definition, but it's a bad user experience and often a leading indicator of a bigger problem.

A reasonable starting point: alert when latency exceeds 3x your p95 baseline over a 5-minute window. Adjust from there based on what you actually observe.

Plugging synthetics into your incident workflow

A synthetic check failing is an incident trigger, same as any other alert. It should route into your on-call system with enough context to act on immediately.

The alert notification should include:

  • Which check failed (name, URL, step)
  • What the failure was (status code, assertion that failed, error message)
  • How long it's been failing
  • A link to the check run with full request/response details

If your synthetic check fires at 3am, the on-call engineer needs to know within 30 seconds whether they're looking at a full site down or a broken checkout flow on one specific region. Vague alerts waste time.

Tie synthetic failures to your severity definitions. A homepage uptime check failing is probably a SEV-1. An end-to-end browser check for a secondary feature failing at 2am might be a SEV-3 that waits for business hours.

False positives and keeping checks healthy

Synthetics break for reasons that have nothing to do with your production systems. The monitoring provider has an outage. A network route flaps. A browser check starts failing because a third-party script changed its DOM structure.

Some things that help:

  • Use multiple locations and require agreement. Alert only when 2+ regions see the same failure. This filters out single-region probe issues.
  • Treat check code like production code. Review changes to synthetic scripts. If a check starts failing after a frontend deploy, it might be your check that needs updating, not your app.
  • Track false positive rate. If a check fires and the on-call engineer closes it without incident, log that. High false positive rates from a specific check mean the check is misconfigured or monitoring something too fragile.

Over time you'll tune checks so that when they fire, something is actually wrong. That trust is what makes synthetic monitoring useful at 2am.

Tooling landscape

Most observability platforms include synthetic monitoring now. Datadog Synthetics, Checkly, Grafana's synthetic monitoring plugin, New Relic Synthetics, and Pingdom are common choices. Smaller, focused options like Better Uptime or Uptime Robot work well for pure uptime monitoring.

If you're running Prometheus and Grafana already, the Grafana synthetic monitoring plugin adds basic checks without pulling in another vendor.

For teams that want full control, you can run your own probes with tools like Playwright (for browser checks) or k6 (for API and multi-step checks) on a schedule, pushing results to your metrics system.

The right choice depends on how complex your checks need to be. For uptime and basic API monitoring, any hosted solution works. For multi-step browser flows with assertions on specific UI elements, you'll want something with solid scripting support.

The goal: fewer surprises

Synthetic monitoring doesn't prevent incidents. It doesn't make your systems more reliable. What it does is shorten the gap between "something broke" and "someone knows something broke."

For most teams, the hardest part of incident response isn't fixing the problem. It's finding out the problem exists in the first place. Reducing that detection window, even by a few minutes, means fewer users affected and less time spent in reactive fire-fighting mode.

Start with one critical user path. Get the alerting right. Build from there.

Related Posts