Dashboard Service Downtime Post-mortem on September 24th, 2018

Post Details

Company

Cypress

Date Published

Sept. 27, 2018

Author

The Cypress

Word Count

524

Company Posts That Month

3

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.cypress.io/blog/downtime-post-mortem-09-24-2018

Summary

On September 24th, 2018, the Cypress Dashboard service experienced a four-hour downtime due to a combination of internal platform errors from their hosting provider and reduced visibility from their monitoring service, New Relic, which was also down. The primary issue was a congestion in their Redis-based queue system, overwhelmed by over 100,000 analytics event tracking jobs, which blocked their API servers. To resolve the problem, they increased the Redis memory limit, added more worker servers, and enhanced the queue system’s concurrency level, which swiftly processed the backlog and restored functionality. Following this incident, Cypress committed to improving their alerting and monitoring systems, launching a dedicated status site, and expanding their infrastructure's geographical distribution to reduce the likelihood of future downtimes. They acknowledged the users' patience and encouraged them to reach out with any questions.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.