Home / Companies / Flagsmith / Blog / Post Details
Content Deep Dive

When Canary Alerts Go Wrong: How We Fixed It and Doubled Down on OSS

Blog post from Flagsmith

Post Details
Company
Date Published
Author
Kim Gustyr
Word Count
1,703
Company Posts That Month
8
Language
English
Hacker News Points
-
Summary

Flagsmith experienced its first downtime incident with its Edge API on January 22, 2026, after maintaining a perfect uptime record since its launch in 2022. The Edge API, which is a low-latency service using AWS Lambda functions across eight regions, encountered issues following a deployment that involved a significant refactor of its evaluation engine. The incident revealed a structural flaw in their canary deployment strategy where alarms weren't version-specific, causing deployment lockouts due to lingering errors from previous versions. The immediate solution involved implementing a "skip-canary" option for urgent fixes while the long-term fix required creating version-scoped canary alarms. This effort led Flagsmith to fork an outdated Serverless plugin and switch to a community-maintained framework, osls, to ensure their infrastructure remained open-source. The experience underscored the importance of robust alarm logic and the value of open-source solutions, driving the company to contribute back to the community by enhancing and publishing their plugins and supporting the osls project.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Serverless 16 1,797 597 92 +165%
Observability 1 3,421 707 180 -24%