Home / Companies / Grafana Labs / Blog / Post Details
Content Deep Dive

How shuffle sharding in Cortex leads to better scalability and more isolation for Prometheus

Blog post from Grafana Labs

Post Details
Company
Date Published
Author
Tom Wilkie
Word Count
2,118
Language
English
Hacker News Points
-
Summary

Cortex, developed by Grafana Labs, has evolved to enhance scalability and isolation for Prometheus through innovations such as shuffle sharding. Originally designed to centralize observability and accommodate multiple tenants in a single, scalable cluster, Cortex uses a distributed system to replace the need for a global federation server. Shuffle sharding, inspired by Amazon's techniques, improves tenant isolation by assigning random sub-clusters within the larger cluster, allowing for better fault tolerance and reduced outage risk. This method enables efficient load distribution while maintaining tenant isolation, crucial for managing varying tenant sizes and ensuring robustness against node failures. As Cortex scales to accommodate hundreds of nodes, shuffle sharding has helped minimize outages and isolate tenants effectively, reducing the impact of potential issues like poisoned requests. Additionally, Grafana Labs has enhanced Cortex with features such as query federation and block storage, and as of March 2022, has shifted focus to Grafana Mimir for long-term metric storage.