Home / Companies / Datadog / Blog / Post Details
Content Deep Dive

2023-03-08 Incident: Infrastructure connectivity issue affecting multiple regions

Blog post from Datadog

Post Details
Company
Date Published
Author
Alexis Lê-Quôc
Word Count
1,976
Company Posts That Month
39
Language
English
Hacker News Points
8
Summary

The Datadog team experienced a global outage starting March 8, 2023, at 06:03 UTC, affecting US1, EU1, US3, US4, and US5 regions across all services. The incident was caused by an automatic security update to systemd on Ubuntu 22.04, which deleted routes managed by the Container Network Interface (CNI) plugin, leading to network stack issues. The outage resulted in data ingestion problems, unavailability of monitors, and limited web access. Recovery efforts involved restoring compute capacity, recovering services in parallel, and addressing cloud provider-specific auto-scaling logic differences. The incident highlighted the importance of considering indirect couplings between regions, prioritizing live data processing, and improving communication with customers during outages. The team learned valuable lessons to strengthen their foundational infrastructure and improve resilience.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Real-time 4 1,875 540 158 +10%
Kubernetes 3 1,589 172 71 +19%
Data Pipeline 2 538 152 55 +19%