Company
Date Published
Author
Laura de Vesine
Word count
3818
Language
English
Hacker News points
None

Summary

The incident response process at Datadog involved multiple teams and a large number of engineers working together to resolve the global outage. The team used a "you build it, you own it" model, with all systems instrumented to provide telemetry data to teams. They had a rotation of senior engineers who were on call for high-severity incidents, and a system in place for rapid escalation to executives and customer support. The response was scaled by design, with workstreams and automation used to manage the incident. Despite some challenges with communication, the team was able to recover from the outage within 48 hours. Lessons learned included the importance of autonomy, ownership, and blamelessness, as well as the need for improved training and practice drills.