When Canary Alerts Go Wrong: How We Fixed It and Doubled Down on OSS
Blog post from Flagsmith
Flagsmith experienced its first downtime incident with its Edge API on January 22, 2026, after maintaining a perfect uptime record since its launch in 2022. The Edge API, which is a low-latency service using AWS Lambda functions across eight regions, encountered issues following a deployment that involved a significant refactor of its evaluation engine. The incident revealed a structural flaw in their canary deployment strategy where alarms weren't version-specific, causing deployment lockouts due to lingering errors from previous versions. The immediate solution involved implementing a "skip-canary" option for urgent fixes while the long-term fix required creating version-scoped canary alarms. This effort led Flagsmith to fork an outdated Serverless plugin and switch to a community-maintained framework, osls, to ensure their infrastructure remained open-source. The experience underscored the importance of robust alarm logic and the value of open-source solutions, driving the company to contribute back to the community by enhancing and publishing their plugins and supporting the osls project.