Company
Date Published
Author
Ray Chen
Word count
903
Language
English
Hacker News points
None

Summary

The Railway team recently experienced a Major Outage that affected their API backend, impacting users who relied on the dashboard, CLI, and Public GraphQL API. The outage was caused by configuration changes made to the API backend's infrastructure, which introduced issues with service mesh internal proxy-ing and unoptimized SQL queries. To address this, the team rolled back changes, fully reverted them, and then fully removed the service mesh from their API backend. They have implemented new strategies for rolling out configuration changes, including staggering deployments and having rollback plans in place. The incident highlighted the importance of providing a best-in-class cloud experience and has led to a renewed commitment by Railway to eliminate similar issues in the future.