How we use metamonitoring Prometheus servers to monitor all other Prometheus servers at Grafana Labs

Post Details

Company

Grafana Labs

Date Published

April 8, 2021

Author

Jeroen Op 't Eynde

Word Count

1,899

Company Posts That Month

22

Language

English

Hacker News Points

-

Post removed?

No

Source URL

grafana.com/blog/how-we-use-metamonitoring-prometheus-servers-to-monitor-all-other-prometheus-servers-at-grafana-labs

Summary

Grafana Labs employs a metamonitoring system using Prometheus servers to monitor other Prometheus servers, ensuring that monitoring failures are quickly identified and addressed. This system involves a network of geographically distributed metamonitoring Prometheus servers that monitor each other across clusters, with a security mechanism similar to a dead-man’s-switch. The setup includes high-availability (HA) Prometheus pairs within Kubernetes clusters, a global Alertmanager cluster, and the use of Vault for managing authentication and secrets across clusters. Alerts are routed through Prometheus to Alertmanager and finally to PagerDuty, with a heartbeat system in place using Dead Man’s Snitch to ensure notification even if the alerting chain fails. This approach ensures redundancy and reliability, allowing Grafana Labs to maintain observability and sound alerts during outages of any part of the monitoring infrastructure.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Secrets Management	22	1,032	60	34	+176%
Kubernetes	16	881	146	53	-13%
Observability	1	535	120	40	+48%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.