Building the foundation for running extra-large language models

Post Details

Company

Cloudflare

Date Published

April 16, 2026

Author

Michelle Chen, Kevin Flansburg, and Vlad Krasnov

Word Count

1,997

Company Posts That Month

43

Language

English

Hacker News Points

-

Post removed?

No

Source URL

blog.cloudflare.com/high-performance-llms

Summary

Cloudflare's recent advancements in hosting large language models like Moonshot's Kimi K2.5 through Workers AI have significantly enhanced the efficiency and speed of processing agentic use cases. By utilizing a sophisticated architecture involving prefill decode disaggregation, Cloudflare is able to optimize GPU usage, improving performance by separating the prefill and decode stages on different servers. Additionally, the implementation of token-aware load balancing and prompt caching has increased throughput and decreased latency, particularly benefiting high-context scenarios like AI code reviews. The company leverages Mooncake's Transfer Engine and Store to efficiently manage KV-cache across multiple GPUs, extending cache beyond VRAM with NVMe storage to handle increased traffic. Moreover, speculative decoding with NVIDIA's EAGLE-3 model enhances token generation speed while maintaining quality. Their proprietary inference engine, Infire, written in Rust, supports multi-GPU configurations, reduces memory overhead, and achieves faster cold-starts, boosting tokens per second throughput by up to 20%. These innovations underscore Cloudflare's commitment to optimizing its infrastructure for high-quality machine learning inference while continuously adapting to new technologies and models.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	6	5,932	1,046	223	-2%
Real-time	1	6,296	1,346	246	-2%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.