Home / Companies / Anyscale / Blog / Post Details
Content Deep Dive

Achieving Up to 67% Cost Savings with Prefill-Decode Disaggregation Using Ray + vLLM on AMD MI325X

Blog post from Anyscale

Post Details
Company
Date Published
Author
Kourosh Hakhamaneshi
Word Count
2,090
Language
English
Hacker News Points
-
Summary

In the exploration of Prefill-Decode (PD) disaggregation using Ray Serve LLM on AMD hardware, this blog post discusses how it can significantly enhance the performance of LLM serving by achieving up to 2.7x better "goodput," translating into cost savings of up to 67%. PD disaggregation separates prefill and decode phases onto dedicated GPUs, thereby eliminating mutual interference and enabling each phase to run closer to its theoretical throughput. While it offers advantages such as consistent TPOT under load and compounded savings over long output sequences, it also introduces operational complexities like KV cache transfer and workload-specific tuning of the prefill-to-decode ratio. The post highlights scenarios where PD is beneficial, particularly for TPOT- or E2E-sensitive workloads, and where aggregated serving is preferable, especially when TTFT is a critical constraint. The blog provides insights into the use of RIXL for KV transfer on AMD MI325X and emphasizes the importance of matching the P:D ratio to workload demands to avoid the potential pitfalls of PD disaggregation.