Home / Companies / New Relic / Blog / Post Details
Content Deep Dive

7 Hard Lessons We Learned About Reliability Here at New Relic

Blog post from New Relic

Post Details
Company
Date Published
Author
Beth Adele Long
Word Count
1,258
Language
English
Hacker News Points
-
Summary

New Relic has undergone significant growth, handling vast amounts of data and increasing in complexity, which has posed challenges in scaling its reliability practices. After a major incident in October 2014, the company recognized the need to improve its response to such events, leading to significant changes in its approach to reliability. Over fourteen months, New Relic implemented several initiatives, including creating email distribution lists, a Change Acceptance Board, and automating certain processes to reduce friction and improve incident management. The company also focused on refining its Mean Time to Resolution (MTTR) by improving processes and involving senior staff in on-call rotations. By defining concrete metrics for reliability and instituting a "Don't Repeat Incidents" policy, New Relic aimed to enhance stability and prevent recurrent issues. The introduction of a service maturity model and embedding Site Reliability Engineers in teams further strengthened their approach. Ultimately, New Relic's journey reflects an iterative process of learning and adaptation in managing reliability in complex systems, with plans to continue evolving as the company grows.