
On-Call Onboarding: What Engineers Need Before Their First Shift
Most teams onboard engineers to on-call the same way: add them to the rotation and wish them luck. If they're lucky, a senior engineer says "ping me if you get stuck." That's not onboarding; that's just hoping nothing breaks on someone's first shift.
The result is predictable. New on-call engineers take longer to triage, escalate more than they need to, and carry extra stress because they don't know what they don't know. Some burn out fast. Others quietly dread every rotation.
A better onboarding process doesn't have to be elaborate. It just needs to be deliberate.
The two things new on-call engineers are missing
Before someone can handle incidents confidently, they need two things that experience can't substitute for:
Context about what "normal" looks like. Without a baseline, every alert feels ambiguous. Is this CPU spike a problem or just the daily batch job? Does this error rate warrant waking someone up? Experienced engineers answer these questions from memory. New engineers can't.
Knowing who and what to reach for. When something goes wrong at 2am, hesitation is expensive. Engineers need to know which runbooks exist, which dashboards matter, how to escalate, and how to find teammates who can help.
Getting both of these things in place before someone's first solo shift is the whole job.
Before the first shadow shift
The week before someone joins the rotation, give them three specific tasks:
Read the last five postmortems. Not to memorize them, but to understand the shape of real incidents at your org. What broke? How was it found? What slowed down the response? Postmortems are the fastest way to build mental models about your system's failure modes.
Walk through the runbooks. Not just "we have runbooks." Have them open each one and check: Is it current? Does it have the right links? Are the steps clear? This accomplishes two things at once: the engineer learns what's there, and you surface gaps before they matter.
Spend 30 minutes with someone who's been on-call for six months. Have them walk through a recent alert from triage to resolution. Not a lecture, just a live narration. "Here's what I saw, here's what I checked first, here's how I decided to escalate."
The shadow shift
The shadow pattern varies by team, but one or two shadow shifts before going solo is usually enough. The goal isn't observation; it's guided practice.
When the on-call engineer gets an alert during a shadow shift, flip the dynamic. Have the experienced engineer watch while the new person drives. They can ask questions, offer hints, but shouldn't take the wheel unless something is genuinely critical.
This matters because passive observation builds almost no muscle memory. Driving, even with a co-pilot, builds a lot.
Track a few things during shadow shifts:
- Did the engineer know where to look first?
- Could they articulate what they were trying to rule out?
- Did they know when to escalate vs. keep investigating?
- Did they know how to communicate status during the incident?
If the answer to any of these is consistently "no," that's a signal to address before they go solo, not after.
A readiness checklist
Before someone's first solo shift, run through this checklist together:
| Area | Check |
|---|---|
| Alerts | Can you explain what each active alert monitors and what "firing" means? |
| Dashboards | Do you know which dashboards to open first for your top three service types? |
| Runbooks | Have you confirmed the runbooks for your services are current and complete? |
| Escalation | Can you name who to call if you're stuck, and how to reach them at 3am? |
| Tooling | Can you acknowledge, reassign, and resolve incidents in your alerting platform? |
| Communication | Do you know where to post status updates during an active incident? |
| Boundaries | Do you know when to declare a P1 vs. keep it contained as a P2? |
Go through it literally. Not "do you understand escalation?" but "name the person you'd call if the database was down and you couldn't figure out why." Gaps become obvious fast.
The checklist isn't a gate. If someone isn't ready, the answer is more ramp time and support, not blocking them indefinitely. The goal is to surface what's missing while there's still time to address it.
The first solo shift
Even with good preparation, the first solo shift is nerve-wracking. Two things help:
Set explicit expectations. Tell them: escalate earlier than you think you need to. It's not a sign of weakness to call for help; it's good incident practice. Senior engineers would rather get a call that turns out to be unnecessary than hear about a P1 that sat unresolved for 30 minutes because someone didn't want to bother them.
Do a debrief after, regardless of what happened. If there were incidents: walk through what went well and what felt uncertain. If the shift was quiet: that's still useful. Ask what they would have done if a specific service had fired, and see how they reason through it.
The debrief takes 20 minutes and it's the highest-leverage thing you can do for retention. People who feel supported during on-call are much less likely to burn out.
Measuring whether onboarding is working
If your onboarding process is working, you should see it in the data over time:
- Escalation rate for new engineers should drop after the second or third rotation
- Time to acknowledge should be low from the start (this is mostly about tooling access)
- Time to resolve should approach team average within two or three months
If escalation rates stay high, the runbooks or dashboards probably need work. If resolution time is stuck, the engineer might need more context about the system architecture, not just the alerting setup.
Ask engineers directly after their first month: "What do you wish you'd known before your first shift?" The answers are usually specific and actionable. Do something with them.
The real payoff
Better on-call onboarding reduces stress for new engineers, but it also makes incidents go better for everyone. When responders know the system and know the process, triage is faster, escalation is cleaner, and postmortems have less "I didn't know that was a thing."
The teams that do this well treat on-call as a skill that can be taught, not a gauntlet that gets easier with time. That's not a soft culture thing; it's just more effective.