/plushcap/analysis/cloudflare/how-cloudflare-runs-prometheus-at-scale

How Cloudflare runs Prometheus at scale

What's this blog post about?

Prometheus is a powerful monitoring solution that excels at handling high cardinality time series data. However, this strength can also be its weakness as it can lead to overloaded instances when dealing with large numbers of metrics or labels. To tackle this issue, we developed two custom patches for Prometheus - one enforcing a total limit on the number of stored time series and another that provides graceful degradation by capping the number of time series per scrape while allowing appends to existing time series after reaching the limit. These patches help prevent overloaded instances, improve performance, and provide a safety net for dealing with high cardinality data. Additionally, we maintain extensive internal documentation to guide engineers through the entire process of working with metrics in Prometheus, from defining metrics to visualizing them in dashboards.

Company
Cloudflare

Date published
March 3, 2023

Author(s)
Lukasz Mierzwa

Word count
6846

Hacker News points
40

Language
English


By Matt Makai. 2021-2024.