How to scale agentic evaluation: Lessons from 200,000 SWE-bench runs
Blog post from AI21 Labs
The evaluation of agentic benchmarks, particularly using AI21 Maestro for SWE-bench Verified, involved addressing significant infrastructure challenges due to the complexity and scale required for statistical confidence. Traditional evaluation methods, designed for short and linear processes, struggled with the stateful, branching nature of agentic systems, necessitating the orchestration of over 200,000 evaluations. Key technical obstacles included managing high-latency, multi-step workflows, ensuring isolated execution environments to prevent state collision, and creating a resilient architecture to handle inevitable infrastructure failures. Initial attempts using local and naive Kubernetes implementations faced issues with resource contention and rate limits, due to assumptions made by the SWE-bench code about local execution environments. A breakthrough was achieved by adopting a multi-tenant simulation environment, allowing shared resources across runs and dramatically reducing failure rates. This setup, capable of supporting up to 8,000 parallel runs, was further optimized by separating the generation and evaluation steps, enabling resumability and efficient analysis even when some runs fail. Scaling this environment not only facilitated statistical confidence but also revealed system efficiencies necessary for real-world applications of enterprise AI systems, underscoring the importance of large-scale, iterative testing.