/plushcap/analysis/datadog/2023-03-08-multiregion-infrastructure-connectivity-issue

2023-03-08 Incident: Infrastructure connectivity issue affecting multiple regions

What's this blog post about?

On March 8, 2023, Datadog experienced a major outage affecting US1, EU1, US3, US4, and US5 regions across all services. The incident began at 06:03 UTC, causing users to lose access to the platform and various services via browser or APIs. Data ingestion was also impacted initially. Investigation started immediately, with the first status page update indicating an issue at 06:31 UTC. Initial signs of recovery were seen at 09:13 UTC (web access restored), and major service operational by 16:44 UTC. All services in all regions were declared operational on March 9, 2023, at 08:58 UTC, with the incident fully resolved on March 10, 2023, at 06:25 UTC once historical data was backfilled. The root cause of the outage was identified as a security update to systemd that caused a latent adverse interaction in the network stack on Ubuntu 22.04 via systemd v249. This issue affected tens of thousands of nodes between 06:00 and 07:00 UTC, causing enough of Datadog's fleet to be offline that the outage was visible to customers by 06:31 UTC. The legacy security update channel responsible for this has been disabled across all affected regions, preventing it from happening again. Datadog's response involved an emergency operation center with over 750 incident responders working in four shifts to bring the incident to full resolution. Frequent updates were provided on the status page to keep customers informed of progress. The high-level steps of recovery included restoring compute capacity, recovering each service in parallel, and addressing issues related to cloud providers' auto-scaling logic. Lessons learned from this outage include improving communication during incidents, prioritizing access to live data over historical data, and enhancing chaos testing to consider larger scale disruptions. Datadog is committed to verifiably improving the resilience of its services based on these learnings.

Company
Datadog

Date published
May 16, 2023

Author(s)
Alexis Lê-Quôc

Word count
1957

Hacker News points
8

Language
English


By Matt Makai. 2021-2024.