Tracing Refinery - Plushcap

Post Details

Company

Honeycomb

Date Published

Jan. 22, 2025

Author

Tyler Helmuth

Word Count

878

Company Posts That Month

11

Language

English

Hacker News Points

-

Source URL

www.honeycomb.io/blog/tracing-refinery

Summary

Refinery 2.9 introduced significant performance improvements, focusing on resolving a critical issue within the software's collect loop, which had been causing Kubernetes pods to fail liveness checks due to excessive processing times. The problem was identified through customer feedback and internal metrics, revealing that the collect loop sometimes took over 20 seconds to complete a cycle. To address this, the team utilized distributed tracing and OpenTelemetry to identify the bottleneck, which was traced to the function sendExpiredTracesInCache, causing delays due to its sampling decisions. Further investigation pinpointed the EMADynamicSampler, whose locking mechanism during sampling decisions and weight recalculations led to increased latency. The solution involved allowing users to set a maximum number of expired traces processed per loop pass, effectively reducing the processing time and establishing a new default value to cap the loop at three seconds. This refined approach, along with the use of tail sampling, enabled the team to optimize performance without unnecessary resource expenditure, demonstrating the efficacy of distributed tracing in diagnosing and resolving complex software issues.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Observability	11	998	293	96	-42%
Kubernetes	1	1,208	158	73	-30%
OpenTelemetry	1	559	44	22	+15%