Scalar Quantization: Background, Practices & More
Blog post from Qdrant
Scalar quantization in Qdrant is a data compression technique that significantly reduces the memory footprint of high-dimensional vector embeddings by converting float32 values to int8, resulting in a 75% memory reduction per value. This method is particularly beneficial when working with large datasets, as it optimizes memory usage without significantly sacrificing precision. The process involves a partially reversible transformation, enabling a balance between compression and precision. Benchmarks reveal that while there is a slight decrease in search precision, the latency improvement is substantial, with search performance showing up to a 60.64% reduction in mean search time. Qdrant's architecture supports combining quantized and original vectors in a single query, allowing for efficient use of RAM and maintaining accuracy through rescoring with original vectors. This approach enables low-end machines to handle a high volume of requests effectively, achieving up to a fourfold decrease in memory usage and doubling performance under certain conditions.