Failure is inevitable: Learning from a large outage, and building for reliability in depth at Datadog

Post Details

Company

Datadog

Date Published

Oct. 15, 2025

Author

Laura de Vesine, Rob Thomas, Maciej Kowalewski

Word Count

2,643

Company Posts That Month

28

Language

English

Hacker News Points

-

Source URL

www.datadoghq.com/blog/engineering/rethinking-reliability

Summary

In March 2023, Datadog experienced a significant outage due to an unsupervised global update, revealing critical limitations in their systems' ability to handle failures gracefully. The incident underscored the need for a shift from preventing failures entirely to embracing strategies for graceful degradation, ensuring partial functionality even in the face of significant disruptions. Datadog's response involved reevaluating their system designs, focusing on data persistence, prioritizing real-time data processing, and implementing chaos testing to validate improvements. They recognized the importance of avoiding global control systems and reducing technical debt to prevent complex failure modes. By prioritizing the end-user experience and building systems that can adapt and recover quickly, Datadog has reduced the impact and duration of incidents, improving their overall resilience and reliability. This shift has resulted in a noticeable decrease in significant incidents and faster recovery times for their products, reflecting a more robust and customer-focused infrastructure.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	4	1,423	250	85	+59%
Real-time	3	6,551	1,245	236	+61%
Secrets Management	2	1,168	199	91	+15%