Ray Serve LLM on Anyscale: APIs for Wide-EP and Disaggregated Serving with vLLM

Post Details

Company

Anyscale

Date Published

Nov. 26, 2025

Author

Seiji Eicher

Word Count

1,281

Language

English

Hacker News Points

-

Source URL

www.anyscale.com/blog/ray-serve-llm-anyscale-apis-wide-ep-disaggregated-serving-vllm

Summary

Ray Serve LLM introduces new APIs that facilitate the deployment of advanced serving patterns for sparse mixture-of-experts models, like DeepSeek and Qwen3, using vLLM on the Anyscale platform. The APIs support wide expert parallelism and disaggregated prefill/decode serving, enabling models to achieve high throughput and optimized latency by balancing expert loads and separating processes that handle input prompts from those generating output tokens. By leveraging Ray Serve, developers can build and orchestrate complex model deployments using Pythonic builder patterns, which allow for dynamic scaling, stateful routing, and fault-tolerant orchestration, while maintaining compatibility with Kubernetes environments. This approach reduces the operational burden of coordinating multi-node setups and enhances performance through programmable orchestration, enabling efficient use of resources and maintaining high service level agreements.