Company
Date Published
Author
Karthik Ranganathan
Word count
1061
Language
English
Hacker News points
None

Summary

Plume`, a company that provides smart home services to ISPs, uses `YugabyteDB` as its database management system. On November 12, 2019, an entire availability zone in the eu-central-1 region of AWS went down, causing an outage for Plume's operations. The outage occurred at 12:07am PST, and despite the AWS status page not being updated until 12:16am PST, `YugabyteDB`'s monitoring system alerted Plume's operations team before that. The database clusters were able to recover automatically once the AZ failure was fixed by AWS, with some instances taking longer due to issues with EBS volumes. After a manual intervention to mount the disk on one of the affected nodes and decommissioning another node that was not reachable via SSH, all alerts were resolved, and Plume's services were confirmed to be running smoothly without any data loss or application impact. The outage highlighted how `YugabyteDB` can handle zone outages with automatic failover and replication across multiple availability zones.