/plushcap/analysis/datadog/how-datadog-manages-incidents

How we manage incidents at Datadog

What's this blog post about?

This article discusses how Datadog manages incidents at scale, focusing on their entire process and the tools they've developed for handling them. They emphasize two core components of incident management: a culture of resilience and blameless organizational accountability, and monitoring their own systems. The company uses its own Incident Management tool to declare incidents, assign severity levels, set up communications channels, and designate first-line responders. They also rely on various support roles such as workstream leads, communications leads, and executive leads during incident response. Datadog prioritizes several metrics in order to gauge the success of their incident management process, including low rates of recurrence, increasing levels of incident complexity, decreased time to detection, and a low rate of spurious alerts.

Company
Datadog

Date published
Nov. 6, 2023

Author(s)
Laura de Vesine, Aaron Kaplan

Word count
2517

Hacker News points
3

Language
English


By Matt Makai. 2021-2024.