How we built a real-world evaluation platform for autonomous SRE agents at scale
Blog post from Datadog
Datadog's Bits AI SRE team developed a comprehensive evaluation platform to improve and trust their autonomous agent, Bits, which investigates production incidents by analyzing diverse data sources such as metrics, logs, and network telemetry. Initially, each feature added to Bits caused unforeseen regressions in other areas, highlighting the need for a robust evaluation system. The team built a replayable evaluation framework that uses curated labels representing real-world scenarios, allowing them to measure and refine Bits' performance effectively. This platform segments and scores investigations, tracks changes over time, and ensures that new features or models don't inadvertently degrade performance. The evolution of this system involved shifting from manual labeling processes to leveraging Bits itself for label creation and validation, thus increasing the label creation rate and quality. By embedding real-world noise into evaluations and automating much of the process, the team can now run extensive, realistic tests, preventing regressions and guiding development. This infrastructure not only supports Bits but also aids other Datadog teams in refining their agents, ensuring continuous improvement through systematic, large-scale evaluations.