Home / Companies / Qodo / Blog / Post Details
Content Deep Dive

How we built a real-world benchmark for AI code review

Blog post from Qodo

Post Details
Company
Date Published
Author
Tomer Yanay
Word Count
1,696
Language
English
Hacker News Points
-
Summary

Qodo's research team has developed a novel code review benchmark, Qodo's Code Review Benchmark 1.0, designed to rigorously evaluate AI-powered code review systems by focusing on both code correctness and quality within realistic scenarios. Existing benchmarks often emphasize bug detection by backtracking fixes to buggy commits, lacking in scale and comprehensive context. Qodo addresses these limitations by injecting defects into genuine, merged pull requests from active open-source repositories, allowing for a more extensive evaluation of AI tools. In a comparative evaluation with seven leading platforms, Qodo outperformed in recognizing defects, achieving an F1 score of 60.1%. This benchmark, publicly available on GitHub, not only measures bug detection but also enforces best-practice compliance, offering a robust standard for AI code review evaluation. The methodology involves injecting complex defects into diverse, production-grade repositories and assessing tools based on their precision and recall, with Qodo demonstrating superior performance by maintaining high recall without excessive noise. This comprehensive approach provides a practical and scalable framework for evaluating AI code review tools across various programming languages and system disciplines.