Home / Companies / Factory / Blog / Post Details
Content Deep Dive

Which Model Reviews Code Best?

Blog post from Factory

Post Details
Company
Date Published
Author
Factory Research, Nizar Alrifai
Word Count
1,568
Language
English
Hacker News Points
-
Summary

A benchmark study evaluated 13 different models for automated code review across 50 pull requests from five major open-source projects to determine the most cost-effective model for identifying real bugs. The analysis revealed that the most expensive models were not necessarily the best performers, with models like MiniMax M2.7 and Kimi K2.5 offering high value at a fraction of the cost. GPT-5.2 emerged as the best overall model, delivering top-tier quality at $1.25 per pull request with a 60.5% F1 score. The study highlighted that model architecture and training have a more significant impact on quality than cost, with open-source models like Kimi K2.5 and GLM-5.1 providing competitive performance against more expensive options. The benchmark, which is open source, guides the selection of models for Droid's code review, and ongoing evaluations aim to refine methodologies and explore cost-effective review strategies such as multi-pass reviews with cheaper models.