Company
Date Published
Author
Alex MacCaw
Word count
828
Language
English
Hacker News points
None

Summary

A recent outage resulted in most servers being down and serving 503 errors due to cascading failures triggered by a fix applied to address high CPU usage related to systemd-journald on CoreOS machines. The fix inadvertently caused Docker to restart on each machine, leading to the shutdown of all running containers and a failure in the service discovery architecture, as well as problems with the in-house Docker registry and Consul servers. The failure was compounded by the inability of Docker to cleanly restart containers and pull necessary images due to a disrupted Docker registry. To prevent future incidents, the team plans to avoid applying fixes on live machines, switch from an in-house to a third-party Docker registry, ensure proper container clean-up, run Consul servers with a persistent data store, and simplify their architecture by running infrastructure services directly with systemd instead of Docker.