How shuffle sharding in Cortex leads to better scalability and more isolation for Prometheus

Post Details

Company

Grafana Labs

Date Published

May 11, 2021

Author

Tom Wilkie

Word Count

2,118

Company Posts That Month

17

Language

English

Hacker News Points

-

Post removed?

No

Source URL

grafana.com/blog/how-shuffle-sharding-in-cortex-leads-to-better-scalability-and-more-isolation-for-prometheus

Summary

Cortex, developed by Grafana Labs, has evolved to enhance scalability and isolation for Prometheus through innovations such as shuffle sharding. Originally designed to centralize observability and accommodate multiple tenants in a single, scalable cluster, Cortex uses a distributed system to replace the need for a global federation server. Shuffle sharding, inspired by Amazon's techniques, improves tenant isolation by assigning random sub-clusters within the larger cluster, allowing for better fault tolerance and reduced outage risk. This method enables efficient load distribution while maintaining tenant isolation, crucial for managing varying tenant sizes and ensuring robustness against node failures. As Cortex scales to accommodate hundreds of nodes, shuffle sharding has helped minimize outages and isolate tenants effectively, reducing the impact of potential issues like poisoned requests. Additionally, Grafana Labs has enhanced Cortex with features such as query federation and block storage, and as of March 2022, has shifted focus to Grafana Mimir for long-term metric storage.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Observability	2	479	132	48	-10%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.