Incident Review: Meta-Review, August 2020

Post Details

Company

Honeycomb

Date Published

Aug. 18, 2020

Author

Emily Nakashima

Word Count

2,204

Language

English

Hacker News Points

-

Source URL

www.honeycomb.io/blog/incident-review-meta-review-august-2020

Summary

Honeycomb recently experienced a series of five incidents between July 28th and August 6th, impacting their production and dogfooding environments, with three incidents leading to partial customer-facing outages. These incidents were triggered by a mix of provider issues, such as a DNS outage in Amazon's Route53, and internal changes, including problematic code deployments that affected the query engine and storage engine database migrations. Despite the quick recovery efforts of their on-call team and effective Service Level Objectives (SLOs) that aided in early detection, the cluster of incidents prompted Honeycomb to conduct a comprehensive meta-incident review to identify common factors. They discovered that the recent incidents often involved configuration code at the seams between different system layers, areas where automated testing is challenging, and institutional knowledge gaps due to team changes. Honeycomb plans to enhance their automated checks to better simulate customer traffic, investigate improvements in their deployment process for faster rollback capabilities, and invest in additional documentation and training to address these challenges. They aim to complete these improvements by the end of Q3 to ensure higher reliability and availability for their customers.