/plushcap/analysis/datadog/engineering-2023-03-08-deep-dive-into-platform-level-recovery

2023-03-08 Incident: A Deep Dive into the Platform-level Recovery

What's this blog post about?

On March 8, 2023, Datadog experienced a global outage that affected all services across multiple regions due to an unexpected system patch. The company's teams worked for approximately 13 hours to restore the majority of their compute capacity across all regions. They faced several challenges during this process, including different responses required for each region, limitations on the maximum number of VM instances in a peering group, and reaching subnet capacity limits. Despite these obstacles, Datadog managed to recover its platform-level capabilities and continue working towards complete recovery.

Company
Datadog

Date published
June 16, 2023

Author(s)
Laurent Bernaille

Word count
4508

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.