/plushcap/analysis/cloudflare/workers-kv-restoring-reliability

Hardening Workers KV

What's this blog post about?

Cloudflare recently experienced a series of incidents affecting their Workers KV (Key-Value) service, which is used to store configuration and data for applications running on Cloudflare's serverless platform. The root cause was an incorrectly deployed code change that caused keys in affected locations to be persisted with invalid configurations across requests, leaving the Worker "frozen" until a rollback was performed 10 minutes later. Additionally, the introduction of a new progressive release process for Workers KV prolonged the incident due to a bug in deployment logic, which dropped some traffic until it was rolled back. Cloudflare estimates that the affected traffic accounted for 0.2-0.5% of KV's global traffic and impacted customers with error rates approaching 20%. To improve reliability and mitigate risks associated with Workers KV, Cloudflare plans to implement several improvements: enhancing observability tooling for unhandled exceptions, improving safety around environmental variable mutations in a Worker, expanding test coverage, refining release processes, adding better logging, adjusting alerting thresholds, and addressing maturity issues with progressive deployment tooling. Cloudflare acknowledges that these incidents have not met their customers' expectations for the KV service and are working to address both the specific issues that led to this cycle of incidents as well as broader reliability concerns across Cloudflare services reliant on or relying on Workers KV.

Company
Cloudflare

Date published
Aug. 2, 2023

Author(s)
Matt Silverlock, Charles Burnett, Rob Sutter, Kris Evans

Word count
2576

Hacker News points
8

Language
English


By Matt Makai. 2021-2024.