New Relic One provides real-time insights into infrastructure and application performance by aggregating telemetry data in one place. However, a recent service interruption highlighted how automation and redundancy protocols can sometimes exacerbate issues. The company uses a cell-based architecture for its Telemetry Data Platform, which allows for scale and reliability but also presents complexity management challenges. Apache Kafka is used extensively for streaming and processing data, but a broker becoming unresponsive can cause data to be stalled behind it. Infrastructure as code is used to provision resources and manage configurations, but this can lead to issues during incident response when engineers need to act quickly. A combination of human error, inadequate safety mechanisms, and timing issues led to the widespread disruption. The company has learned several key lessons from the incident, including the importance of respecting cell isolation, ensuring emergency tools are safe, and continually evaluating incident response processes. New Relic is committed to continuously improving its technology, tooling, and processes to ensure world-class services for its customers.