Home / Companies / Honeycomb / Blog / Post Details
Content Deep Dive

Always. Enable. Keepalives.

Blog post from Honeycomb

Post Details
Company
Date Published
Author
Lex Neva
Word Count
1,720
Language
English
Hacker News Points
-
Summary

During a failure testing project, an issue was discovered with the OpenTelemetry SDK for Go that caused applications to stop sending telemetry for over 15 minutes, despite the underlying infrastructure continuing to function normally. The problem was traced to the Beagle service, which manages stream processing for SLOs within the Honeycomb observability platform. The issue arose during a simulated availability zone (AZ) failure, when one of the Beagle instances stopped sending telemetry due to a TCP connection timeout, leading to false alerts. A deep dive into the network logs revealed that the gRPC connections were not using keepalives, which are crucial for detecting and reopening failed connections. The solution involved configuring the gRPC library to use keepalives, allowing the system to detect and handle connection failures promptly. This fix was successfully tested by simulating another AZ failure, confirming that the telemetry continued to flow uninterrupted. The experience highlighted the importance of enabling keepalives at the application level to ensure robustness against network disruptions, a lesson reinforced through chaos engineering practices.