/plushcap/analysis/cloudflare/pipefail-how-a-missing-shell-option-slowed-cloudflare-down

PIPEFAIL: How a missing shell option slowed Cloudflare down

What's this blog post about?

On December 16, 2021, Cloudflare experienced a slowdown for approximately 30 minutes due to an empty Quicksilver key caused by a missing shell option called "pipefail". The issue started when the Kubernetes cron job failed to populate the key with valid data. This led to the failure of dosd, which provides protection against large attacks and relies on Quicksilver for configuration data. As a result, the Front Line's in-memory cache was flushed, causing a slowdown as requests were stuck waiting for dosd to reply. The issue was resolved by manually re-running the Kubernetes cron job. Lessons learned from this incident include scaling out services to handle high request rates and ensuring code and systems are resilient to failure.

Company
Cloudflare

Date published
April 5, 2022

Author(s)
Alex Forster

Word count
1983

Hacker News points
30

Language
English


By Matt Makai. 2021-2024.