Company
Date Published
Author
Guy Menachem
Word count
1073
Language
English
Hacker News points
None

Summary

An e-commerce company, referred to as "KubeCorp Inc," experienced a severe incident when their system unexpectedly autoscaled to 200 nodes, causing significant downtime and application unresponsiveness. The incident, which began late at night, was detected by the NOC team after noticing unusual CPU and memory consumption, leading to a steep increase in the number of pods from the typical 30 to 4,000. The root cause was traced back to a recent code change involving data structure modifications and extensive database schema alterations, which inadvertently triggered the Horizontal Pod Autoscaler (HPA) to spawn additional pods in response to increased CPU load. The on-call engineer, Rick, faced difficulties isolating and resolving the issue until the Chief DevOps intervened, ultimately resolving the situation after 4.5 hours by reverting the database schema and code. The blog suggests that using Komodor, a unified platform for incident management, could have simplified the troubleshooting process, potentially reducing the resolution time to less than 30 minutes by providing real-time alerts, easy access to Git change history, and step-by-step remediation instructions.