Amazon hiccups, mayhem ensues

Company

Datadog

Date Published

Oct. 23, 2012

Author

Alexis Lê-Quôc

Word count

1310

Language

English

Hacker News points

None

URL

www.datadoghq.com/blog/amazon-hiccups-mayhem-ensues

Summary

The incident occurred when an AWS Elastic Block Storage (EBS) volume used by Datadog's Postgres database started acting up, causing the database to slow down noticeably. The faulty volume scenario led to a manual failover process, which was time-consuming and error-prone due to relying heavily on Chef for automation. Additionally, the use of EBS in critical functions, such as storage for the Postgres database and configuration management server running Chef, contributed to the outage. Datadog's multi-zone deployment, limited use of EBS, and continued data intake during the outage also played a role in mitigating the impact of the incident. However, lessons learned highlight the challenges of shared storage, the importance of having sufficient capacity for recovery, and the need to replace addictive technologies like EBS with more robust alternatives.