After the Retrospective: Heroku Incident #1892

Post Details

Company

Gremlin

Date Published

Oct. 8, 2019

Author

Matthew Helmke

Word Count

2,244

Company Posts That Month

6

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.gremlin.com/blog/heroku-incident-1892

Summary

Heroku's retrospective on the 2019 outage, primarily caused by an upstream dependency failure with AWS, highlights the importance of robust playbooks, simulations, and infrastructure resilience to handle incidents effectively. During the event, Heroku faced significant disruptions in their Dynos and other services like Redis and Postgres, resulting from a power failure at AWS's US-EAST-1 data center, which impacted their ability to allocate additional infrastructure. Despite the challenges, Heroku communicated effectively with customers and took full responsibility, ultimately using the incident as a learning opportunity to improve their systems. They emphasized the need for automation, redundancy, and chaos engineering practices to prevent future outages, and underscored the importance of testing playbooks and enhancing system reliability using tools like Gremlin for chaos experiments. Heroku's proactive steps following the incident demonstrate a commitment to minimizing downtime and enhancing overall service resilience in the face of unexpected failures.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.