We benchmarked GPT-4.1: Here’s what we found

Post Details

Company

Qodo

Date Published

April 13, 2025

Author

Dedy Kredo

Word Count

875

Language

English

Hacker News Points

-

Source URL

www.qodo.ai/blog/benchmarked-gpt-4-1

Summary

In a comparative study to determine which AI model provides the most effective code suggestions for pull requests, GPT-4.1 was found to outperform Claude 3.7 Sonnet in 54.9% of 200 tested scenarios. The evaluation used an AI judge model to assess the usefulness and accuracy of the code suggestions, showing that GPT-4.1 had a slightly higher average score of 6.81 out of 10 compared to Claude 3.7's 6.66. GPT-4.1 demonstrated superior performance by maintaining a better signal-to-noise ratio, accurately identifying real issues, and adhering closely to task requirements. It avoided suggesting unnecessary changes and focused on critical bugs, which was particularly evident in its handling of tasks involving Dockerfile modifications, JSON parsing, and dependency management. The study highlights GPT-4.1's practical application in real-world development workflows, as it effectively balances silence and thoroughness, making it a valuable tool for developers. As a result, GPT-4.1 is now integrated into Qodo Gen, an IDE plugin for coding and testing, available in platforms like VS Code and JetBrains, offering developers enhanced tools for code review and generation.