Home / Companies / Modular / Blog / Post Details
Content Deep Dive

Why LLM Inference Needs a New Kind of Router - Part 3

Blog post from Modular

Post Details
Company
Date Published
Author
Aayush Deshpande
Word Count
1,926
Language
English
Hacker News Points
-
Summary

Modular Cloud's routing layer is designed to efficiently manage routing decisions across pods by utilizing a five-stage process: Prepare, Filter, Score, Pick, and Execute. This approach allows for the creation of complex routing patterns using composable plugins rather than fixed algorithms, addressing customer demands for features like consistent hashing or cache-aware routing with session stickiness without requiring new algorithms from scratch. The framework's use of typed slots in the RoutingContext ensures decoupled communication between plugins, enabling flexibility and robust error-checking at build time. Through the Selector, Workflow, and Executor split, the framework accommodates single-dispatch and disaggregated routing, supporting workflows that involve multiple pods, such as prefill/decode scenarios. This system is validated in production and aims to provide holistic optimizations for large-scale inference by integrating routing and scheduling decisions into a unified framework.