Company
Date Published
Author
The Cypress
Word count
524
Language
English
Hacker News points
None

Summary

On September 24th, 2018, the Cypress Dashboard service experienced a four-hour downtime due to a combination of internal platform errors from their hosting provider and reduced visibility from their monitoring service, New Relic, which was also down. The primary issue was a congestion in their Redis-based queue system, overwhelmed by over 100,000 analytics event tracking jobs, which blocked their API servers. To resolve the problem, they increased the Redis memory limit, added more worker servers, and enhanced the queue system’s concurrency level, which swiftly processed the backlog and restored functionality. Following this incident, Cypress committed to improving their alerting and monitoring systems, launching a dedicated status site, and expanding their infrastructure's geographical distribution to reduce the likelihood of future downtimes. They acknowledged the users' patience and encouraged them to reach out with any questions.