Home / Companies / Sonar / Blog / Post Details
Content Deep Dive

Announcing SonarSweep: Improving training data quality for coding LLMs

Blog post from Sonar

Post Details
Company
Date Published
Author
Tariq Shaukat
Word Count
731
Language
English
Hacker News Points
-
Summary

AI-assisted coding offers significant potential, but the effectiveness of code generated by Large Language Models (LLMs) is heavily dependent on the quality of the training data. Research from Anthropic and Sonar has highlighted that poor quality or malicious data can introduce severe security vulnerabilities and bugs into the generated code, illustrating the "garbage in, garbage out" principle. In response, Sonar has developed SonarSweep, a service that enhances the quality of coding datasets used in LLM training by employing advanced code analysis to reduce quality and security issues. This process has proven effective, reducing security vulnerabilities by up to 67% and bugs by up to 42% without compromising functional correctness. SonarSweep is particularly valuable for companies and developers seeking to improve model performance on limited budgets or within specific environments, such as financial institutions or defense sectors, by enabling the development of customized, reliable AI coding models. The service is now available in early access, allowing the world's leading companies to engage in training LLMs that produce secure, high-quality code at a reduced cost and risk.