Company
Date Published
Author
Laurent Bernaille
Word count
3555
Language
English
Hacker News points
1

Summary

The outage on March 8, 2023, was caused by a new behavior introduced in Ubuntu 22.04's systemd-networkd that flushed all IP rules it did not know about when starting up. This change affected nodes running Ubuntu 22.04, which had adopted the new version of systemd, and led to instances losing network connectivity on multiple regions across distinct cloud providers. The impact was particularly severe on AWS due to its auto-scaling features, which terminated and replaced thousands of instances in a matter of minutes, resulting in lost data stored on local disks. In contrast, Google Cloud and Azure instances could be recovered by simply restarting them, while Kubernetes hosts running on these providers were able to recover their connectivity more quickly. The incident highlighted the importance of understanding the network configuration and potential interactions with cloud provider-specific features when deploying software.