Home / Companies / AI21 Labs / Blog / Post Details
Content Deep Dive

How to scale agentic evaluation: Lessons from 200,000 SWE-bench runs

Blog post from AI21 Labs

Post Details
Company
Date Published
Author
Yaron Sternbach, VP Engineering
Word Count
1,384
Language
English
Hacker News Points
-
Summary

The evaluation of agentic benchmarks, particularly using AI21 Maestro for SWE-bench Verified, involved addressing significant infrastructure challenges due to the complexity and scale required for statistical confidence. Traditional evaluation methods, designed for short and linear processes, struggled with the stateful, branching nature of agentic systems, necessitating the orchestration of over 200,000 evaluations. Key technical obstacles included managing high-latency, multi-step workflows, ensuring isolated execution environments to prevent state collision, and creating a resilient architecture to handle inevitable infrastructure failures. Initial attempts using local and naive Kubernetes implementations faced issues with resource contention and rate limits, due to assumptions made by the SWE-bench code about local execution environments. A breakthrough was achieved by adopting a multi-tenant simulation environment, allowing shared resources across runs and dramatically reducing failure rates. This setup, capable of supporting up to 8,000 parallel runs, was further optimized by separating the generation and evaluation steps, enabling resumability and efficient analysis even when some runs fail. Scaling this environment not only facilitated statistical confidence but also revealed system efficiencies necessary for real-world applications of enterprise AI systems, underscoring the importance of large-scale, iterative testing.