Company
Date Published
Author
Amir Haghighat 4 others
Word count
2111
Language
English
Hacker News points
None

Summary

Baseten Embedding Inference (BEI) is an optimized runtime developed to enhance throughput and reduce latency for embedding, reranking, and classification models using TensorRT-LLM. The solution aims to address challenges posed by the increasing size of modern embedding models, which have evolved from BERT-based architectures to larger LLM-based models. BEI achieves significant performance improvements, offering up to 2.05 times the throughput of existing solutions while maintaining low latency and high concurrency for real-time queries. It supports various architectures, including newer models, and uses techniques like batching, sequence packing, and FP8 quantization to optimize performance further. Additionally, BEI incorporates infrastructure enhancements, such as traffic-based autoscaling and asynchronous inference, to handle high-throughput workloads efficiently. These developments position BEI as a leading option for deploying low-latency, high-throughput embedding models, offering developers flexibility and improved system performance.