NearIRM Team
NearIRM Team6 min read

Building a Service Catalog That Actually Helps During Incidents

It's 2:47am. An alert fires. You're looking at an unfamiliar service name in the error trace, and nobody on the bridge call knows who owns it or where the runbook lives. Someone starts pinging Slack channels. Five minutes pass.

That five minutes doesn't have to happen. A service catalog fixes it.

A service catalog is a registry of every service your team runs: who owns it, who to page, what it depends on, and where to find the relevant docs. It sounds simple because it is. Most teams skip it anyway, and pay for that in wasted time during incidents.

What belongs in a service catalog entry

Keep entries lean. The goal is fast lookup under pressure, not documentation for its own sake.

Each entry should cover:

FieldExample
Service namepayment-processor
DescriptionHandles Stripe webhook ingestion and charge retries
Owning teamPayments team
Primary on-callLink to rotation in your alerting tool
Tier1 (customer-impacting), 2 (internal), 3 (non-critical)
Upstream dependenciesauth-service, postgres-payments
Downstream consumersorder-service, email-notifications
RunbookLink to runbook doc
DashboardsLinks to Grafana, Datadog, etc.
Deployment infoRepo link, deploy process, rollback steps

You don't need all of this on day one. Start with name, owner, tier, and runbook link. That alone cuts most of the confusion in a real incident.

Why teams skip this (and why that's a mistake)

The usual reasoning: "We're a small team, everyone knows who owns what."

That works fine when you're five people and nothing breaks at night. It breaks down when:

  • A new engineer joins and doesn't have the implicit knowledge yet
  • The one person who "knows everything" is on PTO
  • An incident spans multiple services across teams
  • You're six months post-acquisition and nobody is sure which team inherited that legacy service

The other common objection is maintenance. "We'll build it and it'll rot." That's a real risk, but a stale catalog with mostly-correct information is still better than nothing. You just need a lightweight process to keep it from drifting too far.

How to build one without it becoming a project

Don't treat this as a big initiative. You won't finish it.

Start with your tier-1 services. List every service that would cause a customer-facing outage if it went down. That's probably 5-15 services for most companies. Catalog those first. Get them complete. Then expand.

Tie it to your alerting tool. If your alerts can link back to a catalog entry, responders get context automatically when they're paged. PagerDuty has service directories. NearIRM lets you attach service metadata directly to alert rules. Most tools have some version of this. Use it.

Assign ownership clearly, not loosely. "The platform team" is not an owner. A specific on-call rotation is an owner. If there's ambiguity about who owns a service, the catalog forces that conversation before an incident does.

Make it part of new service creation. Add a catalog entry to your service launch checklist. If you deploy without one, you get flagged in code review or your CI pipeline. This is the highest-leverage habit to build early.

Keeping it current

The catalog drifts when:

  1. Teams reorganize and nobody updates ownership
  2. Services get renamed or deprecated
  3. New dependencies form without anyone noting them

Three habits that help:

Review in postmortems. After any incident where you lost time finding an owner or missing a runbook, update the catalog entry as part of the action items. Postmortems are a natural forcing function.

Ownership reviews once a quarter. A 30-minute team meeting to walk through tier-1 and tier-2 entries and verify owners are still correct. Calendar it. Do it.

Automate what you can. If your on-call tool has an API, you can pull current rotation membership into the catalog automatically. If you use infrastructure-as-code, service metadata can live alongside the service definition and get synced.

Formats and tools

You don't need dedicated software to start. Options, roughly in order of complexity:

A shared spreadsheet. Works for small teams. Easy to edit, no setup. Falls apart when you have 50+ services and want to link from alerts.

A wiki page (Notion, Confluence). A bit more structure, still easy. Good middle ground for teams that already live in a wiki.

Backstage. Spotify's open-source service catalog. Purpose-built for this. More setup work, but gives you a proper UI, plugin ecosystem, and API. Worth it once you're past 20-30 services or need deep integration with other tooling.

Your incident management platform. NearIRM and similar tools have service registry features that connect directly to alert routing and on-call schedules. The advantage here is that the catalog isn't a separate artifact you have to link to. It is the configuration.

Connecting the catalog to your alerts

The highest-value integration is getting catalog data into alert context automatically.

When an alert fires, the responder should see: which service is affected, who owns it, and a link to the runbook. No digging. No Slack messages asking "hey who owns payments-v2?"

Most alerting tools support annotations or metadata fields on alert rules. Use them. Tag each alert with its service name, then pull the rest of the context from the catalog. If your platform supports it, surface the on-call contact directly in the alert notification.

This is the point where a catalog stops being a document and starts being operational infrastructure. That's where the real time savings happen.

A realistic benchmark

If you're starting from nothing, a working catalog for your tier-1 services is a one-day project. Get a template, block out time, and fill it in with the people who actually know the answers. Ship it somewhere your team can find it. Then spend the next month wiring it into your alerting.

The goal isn't a perfect catalog. It's making sure that the next time someone is paged at 3am, the answer to "who owns this?" takes two seconds, not five minutes.

Related Posts