
Building a Service Catalog That Actually Helps During Incidents
It's 2:47am. An alert fires. You're looking at an unfamiliar service name in the error trace, and nobody on the bridge call knows who owns it or where the runbook lives. Someone starts pinging Slack channels. Five minutes pass.
That five minutes doesn't have to happen. A service catalog fixes it.
A service catalog is a registry of every service your team runs: who owns it, who to page, what it depends on, and where to find the relevant docs. It sounds simple because it is. Most teams skip it anyway, and pay for that in wasted time during incidents.
What belongs in a service catalog entry
Keep entries lean. The goal is fast lookup under pressure, not documentation for its own sake.
Each entry should cover:
| Field | Example |
|---|---|
| Service name | payment-processor |
| Description | Handles Stripe webhook ingestion and charge retries |
| Owning team | Payments team |
| Primary on-call | Link to rotation in your alerting tool |
| Tier | 1 (customer-impacting), 2 (internal), 3 (non-critical) |
| Upstream dependencies | auth-service, postgres-payments |
| Downstream consumers | order-service, email-notifications |
| Runbook | Link to runbook doc |
| Dashboards | Links to Grafana, Datadog, etc. |
| Deployment info | Repo link, deploy process, rollback steps |
You don't need all of this on day one. Start with name, owner, tier, and runbook link. That alone cuts most of the confusion in a real incident.
Why teams skip this (and why that's a mistake)
The usual reasoning: "We're a small team, everyone knows who owns what."
That works fine when you're five people and nothing breaks at night. It breaks down when:
- A new engineer joins and doesn't have the implicit knowledge yet
- The one person who "knows everything" is on PTO
- An incident spans multiple services across teams
- You're six months post-acquisition and nobody is sure which team inherited that legacy service
The other common objection is maintenance. "We'll build it and it'll rot." That's a real risk, but a stale catalog with mostly-correct information is still better than nothing. You just need a lightweight process to keep it from drifting too far.
How to build one without it becoming a project
Don't treat this as a big initiative. You won't finish it.
Start with your tier-1 services. List every service that would cause a customer-facing outage if it went down. That's probably 5-15 services for most companies. Catalog those first. Get them complete. Then expand.
Tie it to your alerting tool. If your alerts can link back to a catalog entry, responders get context automatically when they're paged. PagerDuty has service directories. NearIRM lets you attach service metadata directly to alert rules. Most tools have some version of this. Use it.
Assign ownership clearly, not loosely. "The platform team" is not an owner. A specific on-call rotation is an owner. If there's ambiguity about who owns a service, the catalog forces that conversation before an incident does.
Make it part of new service creation. Add a catalog entry to your service launch checklist. If you deploy without one, you get flagged in code review or your CI pipeline. This is the highest-leverage habit to build early.
A service catalog entry doesn't need to be complete to be useful. An entry with a name, tier, and owner is already better than no entry at all. Perfection is the enemy of done here.
Keeping it current
The catalog drifts when:
- Teams reorganize and nobody updates ownership
- Services get renamed or deprecated
- New dependencies form without anyone noting them
Three habits that help:
Review in postmortems. After any incident where you lost time finding an owner or missing a runbook, update the catalog entry as part of the action items. Postmortems are a natural forcing function.
Ownership reviews once a quarter. A 30-minute team meeting to walk through tier-1 and tier-2 entries and verify owners are still correct. Calendar it. Do it.
Automate what you can. If your on-call tool has an API, you can pull current rotation membership into the catalog automatically. If you use infrastructure-as-code, service metadata can live alongside the service definition and get synced.
Formats and tools
You don't need dedicated software to start. Options, roughly in order of complexity:
A shared spreadsheet. Works for small teams. Easy to edit, no setup. Falls apart when you have 50+ services and want to link from alerts.
A wiki page (Notion, Confluence). A bit more structure, still easy. Good middle ground for teams that already live in a wiki.
Backstage. Spotify's open-source service catalog. Purpose-built for this. More setup work, but gives you a proper UI, plugin ecosystem, and API. Worth it once you're past 20-30 services or need deep integration with other tooling.
Your incident management platform. NearIRM and similar tools have service registry features that connect directly to alert routing and on-call schedules. The advantage here is that the catalog isn't a separate artifact you have to link to. It is the configuration.
Connecting the catalog to your alerts
The highest-value integration is getting catalog data into alert context automatically.
When an alert fires, the responder should see: which service is affected, who owns it, and a link to the runbook. No digging. No Slack messages asking "hey who owns payments-v2?"
Most alerting tools support annotations or metadata fields on alert rules. Use them. Tag each alert with its service name, then pull the rest of the context from the catalog. If your platform supports it, surface the on-call contact directly in the alert notification.
This is the point where a catalog stops being a document and starts being operational infrastructure. That's where the real time savings happen.
A realistic benchmark
If you're starting from nothing, a working catalog for your tier-1 services is a one-day project. Get a template, block out time, and fill it in with the people who actually know the answers. Ship it somewhere your team can find it. Then spend the next month wiring it into your alerting.
The goal isn't a perfect catalog. It's making sure that the next time someone is paged at 3am, the answer to "who owns this?" takes two seconds, not five minutes.