voyage-code-2: Elevate Your Code Retrieval
Blog post from Voyage AI
Voyage-code-2 is a new embedding model designed for semantic retrieval of code and related text, showing a significant improvement in recall compared to other models like OpenAI and Cohere. It demonstrated a 14.52% increase in recall specifically for code retrieval tasks across 11 datasets derived from popular coding datasets such as HumanEval and MBPP, and a 3.03% average gain on general-purpose text datasets. By vectorizing queries and documents into high-dimensional embeddings, it effectively retrieves relevant code snippets by determining semantic similarities, showcasing its utility in applications like code search, completion, and general code assistance. The model's superior performance is attributed to training on extensive code datasets using advanced techniques such as novel loss functions and contrastive pairs, alongside improvements in inference latency and throughput, making it suitable for interactive production environments. Additionally, voyage-code-2 excels in non-coding tasks, outperforming competitors in diverse domains, and its development underscores the potential for creating more specialized embedding models tailored to specific industries such as finance, healthcare, and law.