pgvector: Fewer dimensions are better

Company

Supabase

Date Published

Aug. 3, 2023

Author

Greg Richardson, Oliver Rice, Egor Romanov

Word count

1399

Language

English

Hacker News points

URL

supabase.com/blog/fewer-dimensions-are-better-pgvector

Summary

Plushcap here` The use of embeddings in AI applications has become increasingly prevalent, with developers utilizing them for various tasks such as search, retrieval, augmented generation, and clustering. Supabase supports storing embeddings in Postgres using the pgvector extension, which allows for similarity calculations between vectors using inner product, cosine distance, or Euclidean distance measures. However, challenges arise when dealing with large datasets, particularly without indexes, which can lead to performance issues. To address this, pgvector offers indexes such as IVF, which cluster vectors into lists and enable approximate similarity search. Despite these advancements, scaling indexes still poses significant challenges due to the large size of vector data. The Massive Text Embedding Benchmark (MTEB) has compared various text embedding models, revealing that smaller dimension sizes can result in faster queries and less RAM usage, while also improving performance. By choosing an appropriate model that balances similarity performance, sequence length, and dimension size, developers can optimize their embeddings for better results.