Announcing SonarSweep: Improving training data quality for coding LLMs
Blog post from Sonar
AI-assisted coding offers significant potential, but the effectiveness of code generated by Large Language Models (LLMs) is heavily dependent on the quality of the training data. Research from Anthropic and Sonar has highlighted that poor quality or malicious data can introduce severe security vulnerabilities and bugs into the generated code, illustrating the "garbage in, garbage out" principle. In response, Sonar has developed SonarSweep, a service that enhances the quality of coding datasets used in LLM training by employing advanced code analysis to reduce quality and security issues. This process has proven effective, reducing security vulnerabilities by up to 67% and bugs by up to 42% without compromising functional correctness. SonarSweep is particularly valuable for companies and developers seeking to improve model performance on limited budgets or within specific environments, such as financial institutions or defense sectors, by enabling the development of customized, reliable AI coding models. The service is now available in early access, allowing the world's leading companies to engage in training LLMs that produce secure, high-quality code at a reduced cost and risk.