DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking
Blog post from Qodo
Qodo has developed a new benchmark dataset featuring real-world questions derived from complex code repositories to improve research and development in code retrieval systems. This dataset addresses a gap left by existing benchmarks, which often rely on artificially generated code snippets or focus on database retrievals rather than code repositories. The dataset was generated by extracting questions from pull requests (PRs), which are rich sources of complex, interconnected code changes, and using large language models (LLMs) to generate realistic developer questions and answers. The evaluation process employs a method called "fact recall" to objectively assess model predictions by verifying the presence of discrete facts from ground-truth answers in predicted answers. Qodo's Deep Research agent outperformed others like OpenAI's Codex and Anthropic's Claude in fact recall performance, demonstrating both speed and accuracy in retrieving code-related information. The release includes 1,144 question-answer pairs, metadata, context, and prompts used in the creation of the dataset, aiming to enhance the capabilities of AI-assisted code navigation and comprehension tools.