Postmortem of yesterday's downtime

Company

Clearbit

Date Published

June 10, 2016

Author

Alex MacCaw

Word count

828

Language

English

Hacker News points

None

URL

clearbit.com/blog/postmortem-of-yesterdays-downtime

Summary

A recent outage resulted in most servers being down and serving 503 errors due to cascading failures triggered by a fix applied to address high CPU usage related to systemd-journald on CoreOS machines. The fix inadvertently caused Docker to restart on each machine, leading to the shutdown of all running containers and a failure in the service discovery architecture, as well as problems with the in-house Docker registry and Consul servers. The failure was compounded by the inability of Docker to cleanly restart containers and pull necessary images due to a disrupted Docker registry. To prevent future incidents, the team plans to avoid applying fixes on live machines, switch from an in-house to a third-party Docker registry, ensure proper container clean-up, run Consul servers with a persistent data store, and simplify their architecture by running infrastructure services directly with systemd instead of Docker.