How we built BEI: high-throughput embedding, reranker, and classifier inference

Company

Baseten

Date Published

July 14, 2025

Author

Amir Haghighat 4 others

Word count

2111

Language

English

Hacker News points

None

URL

www.baseten.co/blog/how-we-built-bei-high-throughput-embedding-inference

Summary

Baseten Embedding Inference (BEI) is an optimized runtime developed to enhance throughput and reduce latency for embedding, reranking, and classification models using TensorRT-LLM. The solution aims to address challenges posed by the increasing size of modern embedding models, which have evolved from BERT-based architectures to larger LLM-based models. BEI achieves significant performance improvements, offering up to 2.05 times the throughput of existing solutions while maintaining low latency and high concurrency for real-time queries. It supports various architectures, including newer models, and uses techniques like batching, sequence packing, and FP8 quantization to optimize performance further. Additionally, BEI incorporates infrastructure enhancements, such as traffic-based autoscaling and asynchronous inference, to handle high-throughput workloads efficiently. These developments position BEI as a leading option for deploying low-latency, high-throughput embedding models, offering developers flexibility and improved system performance.