CodeScaleBench: Testing coding agents on large codebases and multi-repo software engineering tasks
Blog post from Sourcegraph
CodeScaleBench is a new benchmark designed to evaluate coding agents against the complexities of enterprise software development, addressing the limitations of existing benchmarks which often fail to accurately assess agents' capabilities in handling large, multi-repository codebases across various programming languages. The benchmark includes 370 tasks divided into two parts: CodeScaleBench-SDLC, which assesses agents across the full software development lifecycle, and CodeScaleBench-Org, which focuses on organizational-level tasks. Initial findings indicate that agents using Sourcegraph MCP tools outperform baseline configurations in tasks requiring extensive codebase navigation and context retrieval, particularly in cross-repository scenarios. The benchmark emphasizes the importance of robust quality assurance to ensure valid and reliable results, highlighting the need for comprehensive tooling and retrieval strategies in enterprise-scale software development. Despite improvements in context retrieval metrics, challenges remain in driving agents to effectively utilize advanced search tools, as they often default to keyword searches. The ongoing development of CodeScaleBench aims to further refine the evaluation framework, expand the range of tasks, and explore different agent harnesses and MCP tool combinations to enhance the assessment of coding agents in complex environments.