Token-count-based Batching: Faster, Cheaper Embedding Inference for Queries

Post Details

Company

MongoDB

Date Published

Feb. 12, 2026

Author

-

Word Count

1,362

Language

English

Hacker News Points

-

Source URL

www.mongodb.com/company/blog/engineering/token-count-based-batching-faster-cheaper-embedding-inference-for-queries

Summary

Embedding model inference often encounters efficiency challenges when dealing with large volumes of short requests, as commonly seen in search, retrieval, and recommendation systems. At Voyage AI by MongoDB, this issue is addressed by leveraging batching techniques to enhance inference efficiency. The blog post details the inefficiencies of serving these short requests sequentially, which are primarily memory-bound, and explains how padding removal in inference engines like vLLM and SGLang facilitates more effective batching. By adopting a token-count-based batching strategy, which aligns batch size with the actual compute required, the approach reduces per-request latency and cost while increasing throughput and model flops utilization. The implementation involves using Redis to enable efficient token-count-based batching, allowing for better management of GPU resources and maintaining stable latency during traffic spikes. The outcome of this strategy is a significant reduction in GPU inference latency, improved throughput, and better resource utilization, with a reported 50% reduction in GPU inference latency despite using significantly fewer GPUs.