2023-03-08 incident: A deep dive into the platform-level recovery

Post Details

Company

Datadog

Date Published

June 16, 2023

Author

Laurent Bernaille

Word Count

4,523

Company Posts That Month

33

Language

English

Hacker News Points

-

Source URL

www.datadoghq.com/blog/engineering/2023-03-08-deep-dive-into-platform-level-recovery

Summary

The Datadog platform experienced a global outage that affected all services across multiple regions, resulting in a loss of 60 percent of compute capacity. The teams quickly realized they needed to restore the platform and worked through decisions whose outcomes and downstream effects were not immediately obvious. They recovered the EU1 region by rebooting impacted nodes, progressively recovering clusters, and ultimately restoring 100 percent compute capacity. However, they faced challenges in other regions, including US1, where instances remained in an "Unregistered" state due to Vault instability, and scaling up to process a backlog of data caused issues with node registration and IP allocation. The teams learned several lessons, including the importance of strong relationships with partners, testing for global failure scenarios, and preparing for sudden capacity increases. They successfully restored their platform-level capabilities and are now working on restoring the Datadog application.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	21	1,520	189	63	-4%
Secrets Management	19	1,233	99	56	+30%
Observability	2	1,266	225	69	-10%