Home / Companies / New Relic / Blog / Post Details
Content Deep Dive

Nine Takeaways From Going 'Andon Red'

Blog post from New Relic

Post Details
Company
Date Published
Author
Matthew Jarvie
Word Count
1,625
Language
English
Hacker News Points
-
Summary

New Relic's incident response process includes the use of the Andon system, originally developed by Toyota, to manage non-emergency issues that might impact customer experience, such as the "span count bug" in their distributed tracing team. The Andon system allows teams to signal their status—green for OK, yellow for warning, and red for emergency—through a dedicated Slack channel, enabling increased support from other teams and resources. When the distributed tracing team discovered a discrepancy in trace data, they went "Andon red" to prioritize resolving the issue, involving extensive troubleshooting steps, engaging external support, and conducting a blameless retrospective to capture nine key takeaways from the process. These takeaways include prioritizing central issues, increasing communication, accepting manageable risk, and documenting evidence, highlighting the importance of a structured Andon process in efficiently addressing and resolving system problems to prevent future customer dissatisfaction.