Evaluating Open-Source vs. OpenAI Embeddings for RAG: A How-To Guide

Company

Timescale

Date Published

Dec. 18, 2024

Author

Jacky Liang

Word count

1389

Language

English

Hacker News points

None

URL

www.timescale.com/blog/open-source-vs-openai-embeddings-for-rag

Summary

This guide aims to simplify the process of evaluating different embedding models for search or retrieval-augmented generation applications. The authors used pgai Vectorizer, an open-source tool, to test four popular embedding models (OpenAI's small and large models, as well as BGE large and nomic-embed-text) on a dataset of Paul Graham's essays. The evaluation focused on how well each model can find relevant content when given different types of questions. The results showed that OpenAI's large model performed best overall with high accuracy, while the open-source models were competitive. The authors highlight the importance of considering cost constraints, size vs. performance trade-offs, and input data quality when choosing an embedding model. They provide a checklist to help users test other models and offer tips on how to use pgai Vectorizer to simplify the testing process.