Our Commitment to Reliability and Incident Learning
Blog post from New Relic
New Relic experienced a significant service disruption on July 29, impacting data collection and alerting for customers in the US region, due to a technology failure in their Kafka systems exacerbated by automation and redundancy protocols. The incident revealed weaknesses in their cell-based architecture and emergency response tools, which failed to prevent the issue from spreading beyond the affected cell. The failure was compounded by a series of human errors and misjudgments, such as an unsafe change to data retention settings that was applied across all cells, leading to disk space shortages and further broker failures. Despite having safety layers in place, the incident highlighted the need for better isolation during emergencies and more effective incident response processes. New Relic has committed to learning from this incident, improving their systems, and maintaining transparency with their customers to regain trust and ensure future reliability.