2023-03-08 incident: A deep dive into our incident response

Post Details

Company

Datadog

Date Published

June 1, 2023

Author

Laura de Vesine

Word Count

3,818

Company Posts That Month

33

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.datadoghq.com/blog/engineering/2023-03-08-deep-dive-into-incident-response

Summary

The incident response process at Datadog involved multiple teams and a large number of engineers working together to resolve the global outage. The team used a "you build it, you own it" model, with all systems instrumented to provide telemetry data to teams. They had a rotation of senior engineers who were on call for high-severity incidents, and a system in place for rapid escalation to executives and customer support. The response was scaled by design, with workstreams and automation used to manage the incident. Despite some challenges with communication, the team was able to recover from the outage within 48 hours. Lessons learned included the importance of autonomy, ownership, and blamelessness, as well as the need for improved training and practice drills.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	5	1,520	189	63	-4%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.