How we built a real-world evaluation platform for autonomous SRE agents at scale

Post Details

Company

Datadog

Date Published

April 7, 2026

Author

Benjamin Barton

Word Count

3,378

Language

English

Hacker News Points

-

Source URL

www.datadoghq.com/blog/engineering/bits-ai-eval-platform

Summary

Datadog's Bits AI SRE team developed a comprehensive evaluation platform to improve and trust their autonomous agent, Bits, which investigates production incidents by analyzing diverse data sources such as metrics, logs, and network telemetry. Initially, each feature added to Bits caused unforeseen regressions in other areas, highlighting the need for a robust evaluation system. The team built a replayable evaluation framework that uses curated labels representing real-world scenarios, allowing them to measure and refine Bits' performance effectively. This platform segments and scores investigations, tracks changes over time, and ensures that new features or models don't inadvertently degrade performance. The evolution of this system involved shifting from manual labeling processes to leveraging Bits itself for label creation and validation, thus increasing the label creation rate and quality. By embedding real-world noise into evaluations and automating much of the process, the team can now run extensive, realistic tests, preventing regressions and guiding development. This infrastructure not only supports Bits but also aids other Datadog teams in refining their agents, ensuring continuous improvement through systematic, large-scale evaluations.