Home / Companies / MongoDB / Blog / Post Details
Content Deep Dive

Token-count-based Batching: Faster, Cheaper Embedding Inference for Queries

Blog post from MongoDB

Post Details
Company
Date Published
Author
-
Word Count
1,362
Language
English
Hacker News Points
-
Summary

Embedding model inference often encounters efficiency challenges when dealing with large volumes of short requests, as commonly seen in search, retrieval, and recommendation systems. At Voyage AI by MongoDB, this issue is addressed by leveraging batching techniques to enhance inference efficiency. The blog post details the inefficiencies of serving these short requests sequentially, which are primarily memory-bound, and explains how padding removal in inference engines like vLLM and SGLang facilitates more effective batching. By adopting a token-count-based batching strategy, which aligns batch size with the actual compute required, the approach reduces per-request latency and cost while increasing throughput and model flops utilization. The implementation involves using Redis to enable efficient token-count-based batching, allowing for better management of GPU resources and maintaining stable latency during traffic spikes. The outcome of this strategy is a significant reduction in GPU inference latency, improved throughput, and better resource utilization, with a reported 50% reduction in GPU inference latency despite using significantly fewer GPUs.