Which Model Reviews Code Best?

Post Details

Company

Factory

Date Published

April 29, 2026

Author

Factory Research, Nizar Alrifai

Word Count

1,568

Language

English

Hacker News Points

-

Source URL

factory.ai/news/code-review-benchmark

Summary

A benchmark study evaluated 13 different models for automated code review across 50 pull requests from five major open-source projects to determine the most cost-effective model for identifying real bugs. The analysis revealed that the most expensive models were not necessarily the best performers, with models like MiniMax M2.7 and Kimi K2.5 offering high value at a fraction of the cost. GPT-5.2 emerged as the best overall model, delivering top-tier quality at $1.25 per pull request with a 60.5% F1 score. The study highlighted that model architecture and training have a more significant impact on quality than cost, with open-source models like Kimi K2.5 and GLM-5.1 providing competitive performance against more expensive options. The benchmark, which is open source, guides the selection of models for Droid's code review, and ongoing evaluations aim to refine methodologies and explore cost-effective review strategies such as multi-pass reviews with cheaper models.