Building the foundation for running extra-large language models
Blog post from Cloudflare
Cloudflare's recent advancements in hosting large language models like Moonshot's Kimi K2.5 through Workers AI have significantly enhanced the efficiency and speed of processing agentic use cases. By utilizing a sophisticated architecture involving prefill decode disaggregation, Cloudflare is able to optimize GPU usage, improving performance by separating the prefill and decode stages on different servers. Additionally, the implementation of token-aware load balancing and prompt caching has increased throughput and decreased latency, particularly benefiting high-context scenarios like AI code reviews. The company leverages Mooncake's Transfer Engine and Store to efficiently manage KV-cache across multiple GPUs, extending cache beyond VRAM with NVMe storage to handle increased traffic. Moreover, speculative decoding with NVIDIA's EAGLE-3 model enhances token generation speed while maintaining quality. Their proprietary inference engine, Infire, written in Rust, supports multi-GPU configurations, reduces memory overhead, and achieves faster cold-starts, boosting tokens per second throughput by up to 20%. These innovations underscore Cloudflare's commitment to optimizing its infrastructure for high-quality machine learning inference while continuously adapting to new technologies and models.