2023-03-08 incident: A deep dive into the platform-level impact

Post Details

Company

Datadog

Date Published

May 24, 2023

Author

Laurent Bernaille

Word Count

3,555

Language

English

Hacker News Points

1

Source URL

www.datadoghq.com/blog/engineering/2023-03-08-deep-dive-into-platform-level-impact

Summary

The outage on March 8, 2023, was caused by a new behavior introduced in Ubuntu 22.04's systemd-networkd that flushed all IP rules it did not know about when starting up. This change affected nodes running Ubuntu 22.04, which had adopted the new version of systemd, and led to instances losing network connectivity on multiple regions across distinct cloud providers. The impact was particularly severe on AWS due to its auto-scaling features, which terminated and replaced thousands of instances in a matter of minutes, resulting in lost data stored on local disks. In contrast, Google Cloud and Azure instances could be recovered by simply restarting them, while Kubernetes hosts running on these providers were able to recover their connectivity more quickly. The incident highlighted the importance of understanding the network configuration and potential interactions with cloud provider-specific features when deploying software.