CodeScaleBench: Testing coding agents on large codebases and multi-repo software engineering tasks

Post Details

Company

Sourcegraph

Date Published

March 3, 2026

Author

Stephanie Jarmak

Word Count

4,260

Company Posts That Month

4

Language

English

Hacker News Points

-

Post removed?

No

Source URL

sourcegraph.com/blog/codescalebench-testing-coding-agents-on-large-codebases-and-multi-repo-software-engineering-tasks

Summary

CodeScaleBench is a new benchmark designed to evaluate coding agents against the complexities of enterprise software development, addressing the limitations of existing benchmarks which often fail to accurately assess agents' capabilities in handling large, multi-repository codebases across various programming languages. The benchmark includes 370 tasks divided into two parts: CodeScaleBench-SDLC, which assesses agents across the full software development lifecycle, and CodeScaleBench-Org, which focuses on organizational-level tasks. Initial findings indicate that agents using Sourcegraph MCP tools outperform baseline configurations in tasks requiring extensive codebase navigation and context retrieval, particularly in cross-repository scenarios. The benchmark emphasizes the importance of robust quality assurance to ensure valid and reliable results, highlighting the need for comprehensive tooling and retrieval strategies in enterprise-scale software development. Despite improvements in context retrieval metrics, challenges remain in driving agents to effectively utilize advanced search tools, as they often default to keyword searches. The ongoing development of CodeScaleBench aims to further refine the evaluation framework, expand the range of tasks, and explore different agent harnesses and MCP tool combinations to enhance the assessment of coding agents in complex environments.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
MCP	52	4,488	443	150	+34%
Kubernetes	4	1,840	308	106	+33%
Observability	3	3,204	716	172	+14%
AI Coding Assistant	1	1,255	319	126	+24%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.