Company
Date Published
Author
Kevin Paulisse
Word count
2431
Language
English
Hacker News points
None

Summary

Astronomer has significantly improved its incident management process over the past year by unifying the efforts of its Research and Development (R&D) and Customer Reliability Engineering (CRE) teams under a single framework. Initially, the lack of a cohesive process led to confusion and inefficiencies, but a new unified approach, established in February 2023, streamlined communication and collaboration during incidents. This new process includes clear incident definitions, severity levels, an Incident Manager On-Call rotation, and a focus on effective postmortems. Astronomer has also adopted the "Self Responsible Teams" philosophy, ensuring that developers support the services they build, thereby distributing on-call responsibilities more equitably and reducing the burden on infrastructure teams. A custom internal tool, Incident Buddy, was developed to automate incident management tasks, though Astronomer is considering third-party solutions for greater functionality in the future. The changes have bolstered internal trust and increased reliability for customers, with ongoing efforts to enhance automation and self-healing capabilities for further resilience.