New data on code quality: GPT-5.2 high, Opus 4.5, Gemini 3, and more
Blog post from Sonar
The Sonar LLM Leaderboard provides a comprehensive evaluation of AI coding models by analyzing over 4,000 Java programming assignments with the SonarQube static analysis engine, focusing on functional performance, structural quality, security, and maintainability. The analysis revealed that while models like Opus 4.5 Thinking and Gemini 3 Pro achieved high pass rates, they differed significantly in verbosity and complexity, affecting their maintainability and ease of use. GPT-5.2 High, although leading in security with the lowest blocker vulnerabilities per million lines of code, struggled with high code volume and concurrency issues. Conversely, Claude Sonnet 4.5 exhibited the highest rate of critical security vulnerabilities and resource management leaks. The research highlights the trade-offs between performance and complexity, with models like Gemini 3 Pro balancing high pass rates with low verbosity and cognitive complexity, albeit with a higher issue density. The leaderboard aims to inform engineering leaders by providing transparency in how AI models handle essential software engineering fundamentals, ultimately affecting the total cost of ownership due to factors like code smells and design best practice violations.