Ray Serve: Reduce LLM Inference Latency by 60% with Custom Request Routing

Company

Anyscale

Date Published

Sept. 15, 2025

Author

Seiji Eicher

Word count

1709

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/ray-serve-faster-first-token-custom-routing

Summary

Ray Serve, a scalable model serving library built on Ray, has introduced a custom request routing feature, the PrefixCacheAffinityRouter, which significantly reduces latency in large language model (LLM) inference, particularly for models like Deepseek-R1 and Kimi K2. By utilizing a prefix cache that stores computed key-value vectors from previous requests' attention computations, the router effectively routes requests sharing a common prefix to the same replica, optimizing cache hits and reducing GPU cycle waste. This approach enhances performance, achieving a 60% reduction in time-to-first-token (TTFT) and over 40% improvement in end-to-end throughput, particularly benefiting large Mixture of Experts models that require efficient data parallel attention and expert parallel sharding. The new routing strategy was benchmarked using the PrefixRepetitionDataset, demonstrating improved throughput and constant cache hit rates even as the number of replicas scales, compared to the traditional "Power of Two Choices" routing strategy.