voyage-code-3: more accurate code retrieval with lower dimensional, quantized embeddings

Post Details

Company

Voyage AI

Date Published

Dec. 4, 2024

Author

Voyage AI

Word Count

1,175

Language

English

Hacker News Points

-

Source URL

blog.voyageai.com/2024/12/04/voyage-code-3

Summary

Voyage-code-3 is a next-generation embedding model designed for code retrieval, outperforming its predecessors OpenAI-v3-large and CodeSage-large by significant margins across 32 code retrieval datasets. It offers reduced storage and search costs through support for smaller dimensions and quantized formats like int8 and binary, enabled by Matryoshka learning and quantization-aware training. The model maintains high retrieval quality despite lower precision, with flexible embeddings ranging from 256 to 2048 dimensions and a context length of 32K tokens. Addressing the unique challenges of code retrieval, voyage-code-3 is trained on a diverse, high-quality code corpus and evaluated on datasets tailored to real-world applications, demonstrating superior performance in various retrieval tasks such as text-to-code and code-to-code. Users can further enhance retrieval quality with binary rescoring, and the model is accessible with an initial free allocation of tokens for exploration and experimentation.