How Upstash Monitors 25+ Clusters Across the Globe
Blog post from Upstash
Upstash's infrastructure is designed to be cloud-agnostic and scalable, ensuring consistent performance across AWS, GCP, and Fly.io, with a focus on low latency and high availability for its services like Redis and QStash. The company operates over 25 Kubernetes clusters globally, emphasizing the critical role of a sophisticated observability stack to manage this complexity effectively. This system includes tools such as Prometheus for metric collection and alerting, Thanos for long-term monitoring, Humio for centralized logging, and Falco for security, all integrated to provide real-time insights and facilitate swift incident responses. The use of a custom Slack tool, UpstashBot, enhances collaboration and incident management, while Teleport ensures secure access control during incidents. The architecture is built to be resource-efficient, secure, and standardized, supporting the small SRE team in maintaining reliability and scaling capabilities, thus enabling the company to uphold its service promises to customers.