NearIRM Team
NearIRM Team6 min read

On-Call Onboarding: What Engineers Need Before Their First Shift

Most teams onboard engineers to on-call the same way: add them to the rotation and wish them luck. If they're lucky, a senior engineer says "ping me if you get stuck." That's not onboarding; that's just hoping nothing breaks on someone's first shift.

The result is predictable. New on-call engineers take longer to triage, escalate more than they need to, and carry extra stress because they don't know what they don't know. Some burn out fast. Others quietly dread every rotation.

A better onboarding process doesn't have to be elaborate. It just needs to be deliberate.

The two things new on-call engineers are missing

Before someone can handle incidents confidently, they need two things that experience can't substitute for:

Context about what "normal" looks like. Without a baseline, every alert feels ambiguous. Is this CPU spike a problem or just the daily batch job? Does this error rate warrant waking someone up? Experienced engineers answer these questions from memory. New engineers can't.

Knowing who and what to reach for. When something goes wrong at 2am, hesitation is expensive. Engineers need to know which runbooks exist, which dashboards matter, how to escalate, and how to find teammates who can help.

Getting both of these things in place before someone's first solo shift is the whole job.

Before the first shadow shift

The week before someone joins the rotation, give them three specific tasks:

Read the last five postmortems. Not to memorize them, but to understand the shape of real incidents at your org. What broke? How was it found? What slowed down the response? Postmortems are the fastest way to build mental models about your system's failure modes.

Walk through the runbooks. Not just "we have runbooks." Have them open each one and check: Is it current? Does it have the right links? Are the steps clear? This accomplishes two things at once: the engineer learns what's there, and you surface gaps before they matter.

Spend 30 minutes with someone who's been on-call for six months. Have them walk through a recent alert from triage to resolution. Not a lecture, just a live narration. "Here's what I saw, here's what I checked first, here's how I decided to escalate."

The shadow shift

The shadow pattern varies by team, but one or two shadow shifts before going solo is usually enough. The goal isn't observation; it's guided practice.

When the on-call engineer gets an alert during a shadow shift, flip the dynamic. Have the experienced engineer watch while the new person drives. They can ask questions, offer hints, but shouldn't take the wheel unless something is genuinely critical.

This matters because passive observation builds almost no muscle memory. Driving, even with a co-pilot, builds a lot.

Track a few things during shadow shifts:

  • Did the engineer know where to look first?
  • Could they articulate what they were trying to rule out?
  • Did they know when to escalate vs. keep investigating?
  • Did they know how to communicate status during the incident?

If the answer to any of these is consistently "no," that's a signal to address before they go solo, not after.

A readiness checklist

Before someone's first solo shift, run through this checklist together:

AreaCheck
AlertsCan you explain what each active alert monitors and what "firing" means?
DashboardsDo you know which dashboards to open first for your top three service types?
RunbooksHave you confirmed the runbooks for your services are current and complete?
EscalationCan you name who to call if you're stuck, and how to reach them at 3am?
ToolingCan you acknowledge, reassign, and resolve incidents in your alerting platform?
CommunicationDo you know where to post status updates during an active incident?
BoundariesDo you know when to declare a P1 vs. keep it contained as a P2?

Go through it literally. Not "do you understand escalation?" but "name the person you'd call if the database was down and you couldn't figure out why." Gaps become obvious fast.

The first solo shift

Even with good preparation, the first solo shift is nerve-wracking. Two things help:

Set explicit expectations. Tell them: escalate earlier than you think you need to. It's not a sign of weakness to call for help; it's good incident practice. Senior engineers would rather get a call that turns out to be unnecessary than hear about a P1 that sat unresolved for 30 minutes because someone didn't want to bother them.

Do a debrief after, regardless of what happened. If there were incidents: walk through what went well and what felt uncertain. If the shift was quiet: that's still useful. Ask what they would have done if a specific service had fired, and see how they reason through it.

The debrief takes 20 minutes and it's the highest-leverage thing you can do for retention. People who feel supported during on-call are much less likely to burn out.

Measuring whether onboarding is working

If your onboarding process is working, you should see it in the data over time:

  • Escalation rate for new engineers should drop after the second or third rotation
  • Time to acknowledge should be low from the start (this is mostly about tooling access)
  • Time to resolve should approach team average within two or three months

If escalation rates stay high, the runbooks or dashboards probably need work. If resolution time is stuck, the engineer might need more context about the system architecture, not just the alerting setup.

Ask engineers directly after their first month: "What do you wish you'd known before your first shift?" The answers are usually specific and actionable. Do something with them.

The real payoff

Better on-call onboarding reduces stress for new engineers, but it also makes incidents go better for everyone. When responders know the system and know the process, triage is faster, escalation is cleaner, and postmortems have less "I didn't know that was a thing."

The teams that do this well treat on-call as a skill that can be taught, not a gauntlet that gets easier with time. That's not a soft culture thing; it's just more effective.

Related Posts