Why LLM Inference Needs a New Kind of Router - Part 3
Blog post from Modular
Modular Cloud's routing layer is designed to efficiently manage routing decisions across pods by utilizing a five-stage process: Prepare, Filter, Score, Pick, and Execute. This approach allows for the creation of complex routing patterns using composable plugins rather than fixed algorithms, addressing customer demands for features like consistent hashing or cache-aware routing with session stickiness without requiring new algorithms from scratch. The framework's use of typed slots in the RoutingContext ensures decoupled communication between plugins, enabling flexibility and robust error-checking at build time. Through the Selector, Workflow, and Executor split, the framework accommodates single-dispatch and disaggregated routing, supporting workflows that involve multiple pods, such as prefill/decode scenarios. This system is validated in production and aims to provide holistic optimizations for large-scale inference by integrating routing and scheduling decisions into a unified framework.