Tracing Refinery
Blog post from Honeycomb
Refinery 2.9 introduced significant performance improvements, focusing on resolving a critical issue within the software's collect loop, which had been causing Kubernetes pods to fail liveness checks due to excessive processing times. The problem was identified through customer feedback and internal metrics, revealing that the collect loop sometimes took over 20 seconds to complete a cycle. To address this, the team utilized distributed tracing and OpenTelemetry to identify the bottleneck, which was traced to the function sendExpiredTracesInCache, causing delays due to its sampling decisions. Further investigation pinpointed the EMADynamicSampler, whose locking mechanism during sampling decisions and weight recalculations led to increased latency. The solution involved allowing users to set a maximum number of expired traces processed per loop pass, effectively reducing the processing time and establishing a new default value to cap the loop at three seconds. This refined approach, along with the use of tail sampling, enabled the team to optimize performance without unnecessary resource expenditure, demonstrating the efficacy of distributed tracing in diagnosing and resolving complex software issues.