Home / Companies / Cloudflare / Blog / Post Details
Content Deep Dive

Building the foundation for running extra-large language models

Blog post from Cloudflare

Post Details
Company
Date Published
Author
Michelle Chen, Kevin Flansburg, and Vlad Krasnov
Word Count
1,997
Language
English
Hacker News Points
-
Summary

Cloudflare's recent advancements in hosting large language models like Moonshot's Kimi K2.5 through Workers AI have significantly enhanced the efficiency and speed of processing agentic use cases. By utilizing a sophisticated architecture involving prefill decode disaggregation, Cloudflare is able to optimize GPU usage, improving performance by separating the prefill and decode stages on different servers. Additionally, the implementation of token-aware load balancing and prompt caching has increased throughput and decreased latency, particularly benefiting high-context scenarios like AI code reviews. The company leverages Mooncake's Transfer Engine and Store to efficiently manage KV-cache across multiple GPUs, extending cache beyond VRAM with NVMe storage to handle increased traffic. Moreover, speculative decoding with NVIDIA's EAGLE-3 model enhances token generation speed while maintaining quality. Their proprietary inference engine, Infire, written in Rust, supports multi-GPU configurations, reduces memory overhead, and achieves faster cold-starts, boosting tokens per second throughput by up to 20%. These innovations underscore Cloudflare's commitment to optimizing its infrastructure for high-quality machine learning inference while continuously adapting to new technologies and models.