Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Post Details

Company

Grafana Labs

Date Published

April 21, 2026

Author

Yasir Ekinci

Word Count

1,662

Company Posts That Month

21

Language

English

Hacker News Points

-

Post removed?

No

Source URL

grafana.com/blog/o11y-bench-open-benchmark-for-observability-agents

Summary

o11y-bench is an open-source benchmark designed to evaluate AI agents in observability workflows within a real Grafana environment, specifically focusing on tasks like querying metrics, logs, and traces, investigating incidents, and making dashboard changes. Built on the Harbor framework, it provides a standardized environment for testing, helping users discern between seemingly effective agents in demos and those truly reliable in real-world scenarios. The benchmark includes 63 tasks across different observability domains, such as Prometheus, Loki, Tempo, and dashboard management, and uses metrics like Pass^3 and Pass@3 to measure consistency and success rates. By open sourcing the tasks, environment, and grading logic, o11y-bench aims to be transparent and reproducible, encouraging community engagement to advance agent capabilities in observability. The initial benchmark trials showed that while many models could succeed at least once in three attempts, only a few demonstrated consistent reliability, emphasizing the importance of measuring reliability over occasional success in observability tasks.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Observability	33	4,496	812	176	+40%
AI Agents	4	4,430	1,100	236	-3%
LLM	1	5,932	1,046	223	-2%
MCP	1	6,108	613	170	+36%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.