Company
Date Published
Author
Angelo Saraceno
Word count
977
Language
English
Hacker News points
None

Summary

The Railway platform experienced a 30% outage on its US-West compute infrastructure, causing workloads to become unreachable for up to an hour. The root cause was attributed to a newly deployed metrics collection agent that triggered a CPU core soft lock due to excessive memory transfers between kernel and userspace. This issue was only detected after the deployment had begun and affected a subset of machines randomly throughout the fleet, with no discernible pattern or warning signs. The team manually restarted affected hosts, and by 20:23 UTC, all hosts were back online, with total confirmation from customers that workloads were restored. The incident highlighted the need for improvement in internal deployment tooling, building resilience, removing "black box" systems, better system checks, and mandating staged rollouts for all services.