
Using Distributed Tracing to Cut Incident Investigation Time
When a P1 hits, you have a few minutes before Slack explodes with stakeholder questions. Your metrics dashboard shows latency is up across three services. Your logs are a wall of text. You know something is wrong, but not where.
That gap between "something is broken" and "here's the culprit" is where incidents stretch from minutes into hours. Distributed tracing is one of the most effective tools for closing that gap, but a lot of teams treat it as a debugging tool for developers rather than an operations tool for on-call engineers. That's a mistake.
What tracing actually gives you
Metrics tell you a service is slow. Logs tell you what happened on a specific machine. Traces tell you what happened to a specific request as it traveled through your entire system.
A trace follows one request from the moment it enters your system to the moment it returns. Every service the request touches adds a span to that trace: a timestamped record of what work it did, how long it took, and whether it succeeded. When you look at the complete trace, you see a waterfall diagram showing the full call chain.
During an incident, that waterfall is your map. Instead of asking "which of our 40 services is slow?", you can look at a few traces from the affected time window and see exactly where time was spent.
A trace also gives you a single trace_id that links everything together. That one ID can pull up related logs, the specific database query, the downstream HTTP call, and the background job that got triggered, all in one place. That beats trying to correlate events across services by timestamp.
Reading a trace during an incident
When you pull up traces for a degraded endpoint, look for three things:
Unusually long spans. A span that normally takes 5ms is now taking 800ms. That's your starting point. You don't need to know why yet, but you know where to look.
Error spans. Any span with an error status is worth examining. Errors can cascade: if a downstream dependency starts returning 500s, the caller might start retrying, which amplifies the load, which causes more errors.
Unexpected fan-out. If a service is making 200 downstream calls where it normally makes 1, you've found an N+1 problem. This pattern shows up clearly in the waterfall as a long column of identical short spans, and it's nearly invisible in metrics alone.
Here's a simplified example. Say your checkout service is timing out. Your metrics show database latency is high. But which query?
checkout-service [800ms]
└─ product-service [12ms]
└─ inventory-service [790ms] ← here
└─ postgres query: SELECT ... [788ms]
Without tracing, you'd be scanning slow query logs hoping the timestamp lines up. With tracing, you're looking at the exact query that ran during the exact request that failed, linked directly from your incident management tool.
Connecting traces to your alerting workflow
Tracing is most useful when it's part of your incident workflow, not a separate tool you remember to check after the fact.
Alert on span-level signals. Most tracing backends (Jaeger, Tempo, Honeycomb, DataDog APM, Lightstep) can generate metrics from trace data. You can alert on things like error rate for a specific operation name, or the P99 latency of a particular database call. This is more precise than service-level metrics and generates fewer false positives.
Include trace links in alerts. When an alert fires, automatically include a link to the traces from the affected time window. Your alerting tool can pass the service name and timestamp to your tracing backend's search API. That link saves the first 2-3 minutes of an incident, where the on-call engineer is just figuring out where to look.
Add trace queries to runbooks. For common incident types, document the trace queries that help diagnose them. For a database saturation incident, the runbook might say: "Filter traces for db.system = postgresql and sort by duration descending. Look for unexpectedly frequent queries." That gives the on-call engineer a starting point they don't have to invent under pressure.
If your tracing backend supports exemplars, turn them on. Exemplars link a specific high-latency trace to a Prometheus metric data point, so you can jump from a spike on a latency graph directly to the trace that caused it.
Sampling: the thing everyone gets wrong
Distributed tracing at 100% sampling is expensive. So most teams sample, meaning they only record a fraction of traces. This is fine, but the default sampling strategy matters a lot.
Head-based sampling (the most common) decides at the start of a request whether to record it. The problem: you can't know at the start whether the request will be interesting. Errors, slow requests, and outliers are rare, and they're exactly what you miss.
Tail-based sampling decides after the request completes. If the request was slow or errored, record it. If it was normal, don't bother. This is harder to implement, but it means your trace storage is full of exactly the traces you care about during incidents.
A middle ground that works for many teams: record 1-5% of normal traffic head-based, and record 100% of errors and traces above a latency threshold. Most tracing backends support this with some configuration.
If you're routinely unable to find traces for the requests that caused an incident, your sampling strategy is probably the culprit.
What to instrument
Most tracing libraries auto-instrument HTTP servers and clients. That covers a lot, but it misses things that matter during incidents:
| What to instrument | Why it matters |
|---|---|
| Database queries | Slowest thing in most incidents |
| External API calls | Third-party failures are invisible otherwise |
| Message queue publishes and consumes | Breaks the trace chain if not propagated |
| Cron jobs and background workers | These cause incidents but don't get traced |
| Cache hits and misses | High miss rates show up as latency spikes |
The trace context (the trace_id and span_id that link spans together) needs to be propagated through every boundary: HTTP headers, message queue message attributes, async job payloads. If you don't propagate it, the trace chain breaks and you lose visibility the moment a request crosses that boundary.
Message queues are the most commonly missed. When a service publishes a job to a queue and a worker picks it up, those are two separate processes. Without trace context in the message payload, your trace ends at the queue boundary. With it, you can see the full chain from HTTP request to background job completion.
A few things that actually help
Standardize operation names. If service A calls a function getUserById and service B calls the same concept fetchUser, your trace queries become harder. Pick a naming convention and stick to it across services.
Log the trace ID. Include trace_id in your structured logs. This turns your logging backend into a trace-searchable system: given a trace ID from an incident, you can pull all logs from all services that participated in that request, even if your logs are in a different tool than your traces.
Set span attributes for business context. The span knows the HTTP status code. Add the user ID, tenant ID, order ID, whatever makes sense for your system. During incidents involving a specific customer or transaction, this lets you filter traces by business entity, not just technical properties.
Distributed tracing won't replace metrics or logs. But for the specific problem of "I know a request failed, now where do I look," it's the fastest path to an answer. The investment is in instrumentation and integration upfront. The payoff is incident timelines that shrink from hours to minutes because your on-call engineer can see exactly which span, which query, which downstream call caused the problem.