7 Key Reliability Questions for Engineering Managers
Blog post from New Relic
Reliability in software engineering is inherently complex, especially as teams contend with various challenges like defects, capacity issues, and operational debt, exacerbated by the scale of platforms like New Relic, which operates over 300 services and processes vast amounts of data. To maintain reliability, New Relic employs a set of best practices, guided by seven key questions that focus on areas such as ensuring bulletproof deploys and rollbacks, conducting game-days to test incident response, catching regressions before production, maintaining updated risk matrices, ensuring sufficient service capacity, implementing defensive rate limiting, and planning for scalability without major architectural changes. These practices emphasize the importance of continual improvement, as staying ahead of potential issues is crucial for maintaining system reliability and avoiding the pitfalls of falling behind in an evolving technological landscape.