On-call is one of those things every infrastructure company has to get right, but few actually enjoy. When I joined Halcyon, on-call was a source of dread. Engineers would swap shifts, ignore alerts, and burn out. We decided to rebuild our on-call program from first principles.
Eighteen months later, our on-call satisfaction score is 4.2/5 (up from 2.1), and our mean time to resolution has dropped by 60%. Here's what we changed.
The Principles
We wrote down five principles that guide every on-call decision we make:
- If it pages, it matters. We ruthlessly pruned our alert rules from 340 down to 62. Every remaining alert has a clear runbook, a defined severity, and a human-verified threshold. If an alert fires and the on-call engineer can't take meaningful action, the alert gets deleted or rewritten.
- On-call is compensated, always. Engineers on the primary rotation receive additional compensation, regardless of whether they get paged. Being available has a cost, and we should pay for it.
- Rotations are short. We do 3-day rotations instead of week-long ones. This keeps the cognitive load manageable and means no one is on-call for an entire weekend.
- Escalation is not failure. We actively encourage escalation. If the primary on-call can't resolve an issue within 15 minutes, they should escalate -- no questions asked, no judgment. The goal is resolution, not heroics.
- Every incident gets a postmortem. And every postmortem is blameless. We focus on systems and processes, not individuals. Our postmortem template asks "what can we change so this can't happen again?" not "who made the mistake?"
Blameless Postmortems in Practice
Blameless doesn't mean accountless. We still identify contributing factors and assign action items. But the language matters. We say "the deploy pipeline didn't catch the regression" instead of "the engineer didn't test properly." We focus on the guardrails that should have existed, not the human who walked past them.
We focus on the guardrails that should have existed, not the human who walked past them.
Every postmortem produces at least one concrete improvement: a new alert, a runbook update, a circuit breaker, or a process change. If a postmortem doesn't produce an action item, we consider it incomplete.
What We'd Do Differently
We should have invested in on-call tooling earlier. For the first year, we cobbled together PagerDuty, Slack, and Google Docs. Building a unified incident management flow -- page, acknowledge, communicate, resolve, postmortem -- into a single tool cut our mean time to resolution by 30% on its own.
On-call doesn't have to be something your team dreads. With clear principles, proper compensation, and a culture that treats incidents as learning opportunities, it can become a genuine point of pride.