How to scale agentic evaluation: Lessons from 200,000 SWE-bench runs

Post Details

Company

AI21 Labs

Date Published

Jan. 8, 2026

Author

Yaron Sternbach, VP Engineering

Word Count

1,384

Company Posts That Month

9

Language

English

Hacker News Points

-

Source URL

www.ai21.com/blog/scaling-agentic-evaluation-swe-bench

Summary

The evaluation of agentic benchmarks, particularly using AI21 Maestro for SWE-bench Verified, involved addressing significant infrastructure challenges due to the complexity and scale required for statistical confidence. Traditional evaluation methods, designed for short and linear processes, struggled with the stateful, branching nature of agentic systems, necessitating the orchestration of over 200,000 evaluations. Key technical obstacles included managing high-latency, multi-step workflows, ensuring isolated execution environments to prevent state collision, and creating a resilient architecture to handle inevitable infrastructure failures. Initial attempts using local and naive Kubernetes implementations faced issues with resource contention and rate limits, due to assumptions made by the SWE-bench code about local execution environments. A breakthrough was achieved by adopting a multi-tenant simulation environment, allowing shared resources across runs and dramatically reducing failure rates. This setup, capable of supporting up to 8,000 parallel runs, was further optimized by separating the generation and evaluation steps, enabling resumability and efficient analysis even when some runs fail. Scaling this environment not only facilitated statistical confidence but also revealed system efficiencies necessary for real-world applications of enterprise AI systems, underscoring the importance of large-scale, iterative testing.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	9	930	177	84	-40%
MCP	3	2,803	327	131	-43%
LLM	1	3,836	662	193	+2%