How One Company Accidently Autoscaled to 200 Nodes and Crashed The App

Post Details

Company

Komodor

Date Published

Feb. 8, 2022

Author

Guy Menachem

Word Count

1,073

Language

English

Hacker News Points

-

Source URL

komodor.com/blog/company-autoscales-to-200-nodes-and-crashe-the-app

Summary

An e-commerce company, referred to as "KubeCorp Inc," experienced a severe incident when their system unexpectedly autoscaled to 200 nodes, causing significant downtime and application unresponsiveness. The incident, which began late at night, was detected by the NOC team after noticing unusual CPU and memory consumption, leading to a steep increase in the number of pods from the typical 30 to 4,000. The root cause was traced back to a recent code change involving data structure modifications and extensive database schema alterations, which inadvertently triggered the Horizontal Pod Autoscaler (HPA) to spawn additional pods in response to increased CPU load. The on-call engineer, Rick, faced difficulties isolating and resolving the issue until the Chief DevOps intervened, ultimately resolving the situation after 4.5 hours by reverting the database schema and code. The blog suggests that using Komodor, a unified platform for incident management, could have simplified the troubleshooting process, potentially reducing the resolution time to less than 30 minutes by providing real-time alerts, easy access to Git change history, and step-by-step remediation instructions.