Home / Companies / Honeycomb / Blog / Post Details
Content Deep Dive

Incident Review: Shepherd Cache Delays

Blog post from Honeycomb

Post Details
Company
Date Published
Author
Fred Hebert
Word Count
1,514
Language
English
Hacker News Points
-
Summary

An outage on September 8th, 2022, caused significant interruptions in the ingest system for over eight hours, revealing critical vulnerabilities in the system's architecture, particularly with the in-memory caching mechanism of Shepherd hosts. Despite initial efforts to stabilize the system through vertical scaling and aggressive sampling, the issue persisted, manifesting as a “shark fin” pattern in system performance graphs, where accumulated requests completed simultaneously but unpredictably. The Engineers discovered that the cache's locking mechanism led to bottlenecks when entries were backfilled, causing cascading failures in the Shepherd and Refinery clusters. A decisive fix involved pre-filling the cache, which successfully improved system stability, although the root cause of the cascading failures remained elusive. The incident highlighted the challenges of managing complex systems under stress, underscoring the importance of accurate mental models, the impact of technical issues on engineering teams, and the necessity for long-term architectural improvements to support growth.