
On-Call Scheduling Patterns That Actually Work
Most teams pick their on-call rotation by accident. Someone sets up a round-robin in the tool, adds the team, and moves on. Six months later people are burning out, incidents are getting picked up slowly at 2am, and no one can remember why the schedule is the way it is.
Rotation design matters more than people realize. The wrong pattern for your team's size, geography, and service complexity creates drag on every incident. Getting it right is mostly about knowing what options exist and what each one trades off.
Here are the main patterns, when they work, and when they don't.
Round-Robin
The simplest rotation. Each person takes a week (or a day, depending on volume), then the schedule cycles to the next.
When it works: Small teams (3-8 people) with relatively uniform service ownership. Everyone knows the stack roughly equally, incidents are infrequent enough that being on-call isn't brutal, and fairness is easy to maintain.
When it breaks down: As teams grow and services specialize, round-robin starts creating mismatches. The person on-call for a Kafka incident might be the engineer who's spent the last year working on the frontend. They'll either struggle through it or escalate immediately, which defeats the point of being on-call at all.
Round-robin also ignores timezone. If your team spans continents and you're rotating globally, someone is always getting paged at 3am.
Primary/Secondary (Shadow) Rotation
Two people share each shift: a primary who owns the response, and a secondary who backs them up if the primary misses an alert or needs help.
The secondary role is also a good place to put engineers who are learning the stack. They shadow real incidents without carrying the full responsibility, and the primary doesn't feel alone during complex outages.
When it works: Teams where incidents frequently benefit from a second pair of hands. Systems with complex interdependencies, where triage alone can take 20 minutes. Teams actively trying to spread on-call knowledge to newer engineers.
When it breaks down: The secondary slot doubles your on-call staffing cost. On a 6-person team, primary/secondary means each person is on-call every three weeks as primary, plus additional secondary weeks. That can feel like a lot if incidents are frequent.
Also, the secondary role only works if the secondary is actually engaged. If they're just a fallback that never gets used, it trains people to treat the role as unimportant.
Squad-Based (Service Ownership) Rotation
Each team owns the on-call rotation for the services they built and maintain. The payments team is on-call for the payments service. The platform team covers infrastructure. And so on.
This is the model many companies move toward as they grow. The logic is straightforward: the people who wrote the code know it best, and ownership creates accountability for reliability.
When it works: Organizations with clear service boundaries and teams large enough to sustain their own rotations (generally 4+ people per squad). Works well alongside SLO-based alerting, where each team owns their own error budget.
When it breaks down: Small teams can't sustain their own rotations without burning out. A 2-person microservices team covering a 24/7 rotation is unsustainable. You'll need to pool teams or share rotations across squads with related services.
Cross-team incidents also get complicated. When a payment failure is actually a database issue is actually a network thing, squad-based rotations require good escalation paths and communication across teams.
Follow-the-Sun
For organizations with engineers spread across multiple timezones, follow-the-sun divides the 24-hour day into shifts handled by different regional teams. The US team covers US business hours, EMEA covers European hours, APAC covers Asia-Pacific hours. No one gets paged outside their working day.
When it works: Companies with genuine engineering presence in multiple regions and sufficient headcount in each. When it's working, follow-the-sun means no one takes overnight pages. Incidents get handed off between regions at the end of each shift.
When it breaks down: It requires enough engineers in each timezone to form a real rotation, not just a couple of people. And handoffs between regions need to be disciplined. If context doesn't transfer clearly at shift boundaries, the incoming team wastes 20 minutes figuring out what the outgoing team already knew.
A common failure mode: a company implements follow-the-sun, but one region is understaffed, so that region's engineers are still covering odd hours to fill gaps. The rotation looks good on paper but doesn't deliver the promise.
Handoff quality makes or breaks follow-the-sun. Build a structured handoff template: current incidents and their status, anything that was investigated but not resolved, and any planned changes that could affect reliability. Five minutes of good notes saves an hour of context-gathering.
Tiered Escalation
Some teams split on-call into tiers. A first tier (L1) handles initial triage and low-complexity incidents. A second tier (L2) consists of domain experts who get paged only if L1 can't resolve.
This works well for companies with high alert volume and a mix of incident complexity. L1 engineers handle the routine stuff. L2 engineers only get pulled in for the genuinely hard problems.
When it works: Larger organizations with dedicated SRE or platform teams who can staff L1, and specialist engineers who would otherwise spend a lot of on-call time on incidents that don't need their expertise.
When it breaks down: If your L1 team isn't empowered to resolve incidents, they become a routing layer that just escalates everything. That adds latency without adding value. L1 needs runbooks, appropriate access, and the authority to make decisions, not just gather information and hand it off.
Choosing the Right Pattern
A few questions help narrow things down:
How large is the team? Under 5 engineers, primary/secondary will strain people. Round-robin is usually more sustainable. Over 10, you have options and squad-based starts making sense.
How geographically distributed are you? Distributed teams with significant timezone gaps should at minimum consider timezone-aware rotations, even if full follow-the-sun isn't feasible yet.
How specialized are your services? High specialization argues for squad-based. Generalist teams with broad ownership work better with shared rotations.
What's your incident volume? High volume teams need more structure: clear escalation paths, tiered response, runbooks that reduce cognitive load. Low volume teams can get by with simpler setups.
Most mature organizations end up with a hybrid. Squad-based primary rotations, with a secondary across squads for backup. Or follow-the-sun at the top level, with round-robin within each regional team.
A Few Practical Notes
Rotation length: Weekly rotations are standard. Some teams prefer shorter (3-4 days) for high-volume services, to reduce cumulative fatigue. Longer than a week usually means people start dreading their turn.
Handoff timing: Schedule handoffs mid-week, not Monday morning. Monday handoffs mean the incoming engineer hits a backlog of weekend noise. Wednesday handoffs give the incoming engineer a quieter start.
On-call windows: Some teams run 8-hour on-call windows instead of 24-hour. This makes sense for follow-the-sun, but in practice often means engineers still check their phones outside their window. Be honest about whether your team actually disconnects outside designated hours.
Rotation fairness: Track who takes incidents and when. Raw rotation equality doesn't mean equal burden if some people always get the 3am pages. Audit your data periodically and adjust.
The rotation pattern is only one piece. Good tooling, useful runbooks, sensible alert thresholds, and a culture where engineers feel supported all matter just as much. But getting the scheduling structure right is the foundation everything else builds on.