Company
Date Published
Author
Laura de Vesine, David Lentz
Word count
1749
Language
English
Hacker News points
1

Summary

Datadog teams create on-call rotations to ensure continuous uptime for their critical services. The size of the team largely determines the structure of a rotation, balancing service coverage with a sustainable workload for responders. Team size shapes how the rotation works and can affect the experience of everyone in the rotation. Small teams often use 24/7 rotations with brief turns in the on-call role and fewer teammates available to alternate. The shift length varies, but is generally between eight and 12 hours, aiming to maximize effectiveness while minimizing fatigue. Engineers should do only on-call work as much as possible during their on-call days, separating feature work from on-call responsibilities. Responders receive comprehensive support before, during, and after their on-call duties, including training, resources, and backup secondary responders. Managers participate in their teams' rotations to understand procedures and improve the experience. Datadog provides tools and platforms to facilitate effective on-call practice, such as On-Call, which integrates monitoring, paging, and incident response into a single platform.