Home / Companies / New Relic / Blog / Post Details
Content Deep Dive

Caring for container-based services with checks, monitoring, and alerts

Blog post from New Relic

Post Details
Company
Date Published
Author
Jonathan Owens
Word Count
2,006
Language
English
Hacker News Points
-
Summary

New Relic's Container Fabric (CF) team focuses on deploying and maintaining containerized services through a platform that manages around 1,000 machines, utilizing physical hardware across multiple data centers. The team emphasizes the importance of health and readiness checks, monitoring, and alerting to ensure the reliability and observability of services, which contributes to customer satisfaction and operator well-being. They have developed best practices for maintaining service integrity, such as defining health checks to represent a service's true state, considering latency and dependency handling, and managing cold starts and thundering herds. Additionally, the CF platform uses a metrics pipeline to gather container data, enabling the creation of key alerts for CPU usage, out-of-memory events, SIGKILLs, and excessive load-balancer connections to prevent service disruptions. By fostering a culture of resilience and observability, the CF team aims to keep operators rested and customers happy, believing that robust architecture practices directly impact service level agreements and customer satisfaction.