Home / Companies / Honeycomb / Blog / Post Details
Content Deep Dive

Incident Review: Meta-Review, August 2020

Blog post from Honeycomb

Post Details
Company
Date Published
Author
Emily Nakashima
Word Count
2,204
Language
English
Hacker News Points
-
Summary

Honeycomb recently experienced a series of five incidents between July 28th and August 6th, impacting their production and dogfooding environments, with three incidents leading to partial customer-facing outages. These incidents were triggered by a mix of provider issues, such as a DNS outage in Amazon's Route53, and internal changes, including problematic code deployments that affected the query engine and storage engine database migrations. Despite the quick recovery efforts of their on-call team and effective Service Level Objectives (SLOs) that aided in early detection, the cluster of incidents prompted Honeycomb to conduct a comprehensive meta-incident review to identify common factors. They discovered that the recent incidents often involved configuration code at the seams between different system layers, areas where automated testing is challenging, and institutional knowledge gaps due to team changes. Honeycomb plans to enhance their automated checks to better simulate customer traffic, investigate improvements in their deployment process for faster rollback capabilities, and invest in additional documentation and training to address these challenges. They aim to complete these improvements by the end of Q3 to ensure higher reliability and availability for their customers.