Company
Date Published
Author
Alexis Lê-Quôc
Word count
1976
Language
English
Hacker News points
8

Summary

The Datadog team experienced a global outage starting March 8, 2023, at 06:03 UTC, affecting US1, EU1, US3, US4, and US5 regions across all services. The incident was caused by an automatic security update to systemd on Ubuntu 22.04, which deleted routes managed by the Container Network Interface (CNI) plugin, leading to network stack issues. The outage resulted in data ingestion problems, unavailability of monitors, and limited web access. Recovery efforts involved restoring compute capacity, recovering services in parallel, and addressing cloud provider-specific auto-scaling logic differences. The incident highlighted the importance of considering indirect couplings between regions, prioritizing live data processing, and improving communication with customers during outages. The team learned valuable lessons to strengthen their foundational infrastructure and improve resilience.