Company
Date Published
Author
Tom Preston-Werner
Word count
428
Language
English
Hacker News points
None

Summary

On a particular morning, a surge of SSH connections overwhelmed the site's backend due to an RPC call that checked for repository existence, leading to cascading delays and increased load on the frontends. The removal of this RPC call alleviated the immediate issue, but an unrelated problem persisted due to recent package upgrades to the RPC stack, which caused the servers to serve requests sporadically upon restart. After rolling back to a previous stable state and restarting the daemons, normal site operations resumed within a couple of hours. In response, plans were made to enhance testing, improve SSH script logging, and implement more granular package deployments to mitigate future issues. Additionally, the disruption helped in identifying several subtle bugs and contributed to a deeper understanding of the new architecture, promising a more robust system going forward.