Home / Companies / Clearbit / Blog / Post Details
Content Deep Dive

Postmortem of yesterday's downtime

Blog post from Clearbit

Post Details
Company
Date Published
Author
Alex MacCaw
Word Count
828
Company Posts That Month
3
Language
English
Hacker News Points
-
Summary

A recent outage resulted in most servers being down and serving 503 errors due to cascading failures triggered by a fix applied to address high CPU usage related to systemd-journald on CoreOS machines. The fix inadvertently caused Docker to restart on each machine, leading to the shutdown of all running containers and a failure in the service discovery architecture, as well as problems with the in-house Docker registry and Consul servers. The failure was compounded by the inability of Docker to cleanly restart containers and pull necessary images due to a disrupted Docker registry. To prevent future incidents, the team plans to avoid applying fixes on live machines, switch from an in-house to a third-party Docker registry, ensure proper container clean-up, run Consul servers with a persistent data store, and simplify their architecture by running infrastructure services directly with systemd instead of Docker.

Trends Found in this Post

No tracked trend matches for this post yet.