State-of-the-Art Code Retrieval With Efficient Code Embedding Models
Blog post from Qodo
Qodo-Embed-1 is a new family of code embedding models that achieves state-of-the-art performance with a smaller footprint compared to existing models, excelling in the CoIR benchmark for code-oriented information retrieval. The 1.5B model scores 68.53, surpassing larger models, while the 7B variant achieves 71.5. The main challenge with traditional embedding models is their inadequacy in retrieving relevant code snippets based on natural language queries, as they often focus on language patterns rather than code-specific elements. By training Qodo-Embed-1 using synthetic data generation, including natural language descriptions and docstrings, the model effectively aligns queries with code snippets, reducing computational overhead and costs while improving accuracy. The smaller model size enhances accessibility and deployment, offering efficient and cost-effective solutions for developers. This model family is available on Hugging Face, with the 1.5B model open-sourced under the Openrail++-M license and the 7B model available commercially.