Company
Date Published
Author
Matthew Jarvie
Word count
1568
Language
English
Hacker News points
None

Summary

The distributed tracing team at New Relic used the Andon system to resolve a bug that impacted the accuracy of traces, which they called "the span count bug." The team first activated an Andon status to signal for help and then worked together to identify the root cause of the issue. They found that a service called "The Trace Indexer" was using a RocksDB instance with a cache layer on top, which was causing the problem. To fix it, they resized the cache to prevent it from evicting records before aggregations were complete and pursued a Kafka Streams upgrade as a permanent fix. The team learned several key takeaways from their experience, including declaring priorities, engaging outside support early and often, increasing communication inside and outside the team, prioritizing troubleshooting steps, accepting a higher risk tolerance, being willing to look at big changes, using data, documenting evidence, and avoiding duplicating efforts. Overall, the Andon system helped the team resolve the issue efficiently and effectively, while also improving their communication and collaboration processes.