Introducing o11y-bench: an open benchmark for AI agents running observability workflows
Blog post from Grafana Labs
o11y-bench is an open-source benchmark designed to evaluate AI agents in observability workflows within a real Grafana environment, specifically focusing on tasks like querying metrics, logs, and traces, investigating incidents, and making dashboard changes. Built on the Harbor framework, it provides a standardized environment for testing, helping users discern between seemingly effective agents in demos and those truly reliable in real-world scenarios. The benchmark includes 63 tasks across different observability domains, such as Prometheus, Loki, Tempo, and dashboard management, and uses metrics like Pass^3 and Pass@3 to measure consistency and success rates. By open sourcing the tasks, environment, and grading logic, o11y-bench aims to be transparent and reproducible, encouraging community engagement to advance agent capabilities in observability. The initial benchmark trials showed that while many models could succeed at least once in three attempts, only a few demonstrated consistent reliability, emphasizing the importance of measuring reliability over occasional success in observability tasks.