Company
Date Published
Author
Ray Chen
Word count
890
Language
English
Hacker News points
None

Summary

We recently experienced a networking outage on Railway that affected public and private networking across multiple regions, with the Asia region being impacted the most. The outage was caused by a maintenance issue on Google Cloud SQL's Postgres service, which terminated all connections to our network control plane, leading to routing information loss and subsequent crashes in some instances. The root cause was traced back to our global control plane design, where we should have had regional control planes instead of a single global one to contain the blast radius of failures. To fix this, Railway is prioritizing an upgrade to regional control planes and implementing measures such as persistent caching, improved startup times, and enhanced Google Cloud SQL configuration to prevent similar incidents in the future. The outage was resolved after 10 minutes, but not before causing instability for users visiting Railway-hosted domains or communicating over our private network.