How we built a real-world benchmark for AI code review

Post Details

Company

Qodo

Date Published

Feb. 4, 2026

Author

Tomer Yanay

Word Count

1,696

Language

English

Hacker News Points

-

Source URL

www.qodo.ai/blog/how-we-built-a-real-world-benchmark-for-ai-code-review

Summary

Qodo's research team has developed a novel code review benchmark, Qodo's Code Review Benchmark 1.0, designed to rigorously evaluate AI-powered code review systems by focusing on both code correctness and quality within realistic scenarios. Existing benchmarks often emphasize bug detection by backtracking fixes to buggy commits, lacking in scale and comprehensive context. Qodo addresses these limitations by injecting defects into genuine, merged pull requests from active open-source repositories, allowing for a more extensive evaluation of AI tools. In a comparative evaluation with seven leading platforms, Qodo outperformed in recognizing defects, achieving an F1 score of 60.1%. This benchmark, publicly available on GitHub, not only measures bug detection but also enforces best-practice compliance, offering a robust standard for AI code review evaluation. The methodology involves injecting complex defects into diverse, production-grade repositories and assessing tools based on their precision and recall, with Qodo demonstrating superior performance by maintaining high recall without excessive noise. This comprehensive approach provides a practical and scalable framework for evaluating AI code review tools across various programming languages and system disciplines.