Serving DeepSeek-V4: why million-token context is an inference systems problem

Post Details

Company

Together AI

Date Published

May 9, 2026

Author

Together AI

Word Count

2,573

Language

English

Hacker News Points

-

Source URL

www.together.ai/blog/serving-deepseek-v4-why-million-token-context-is-an-inference-systems-problem

Summary

DeepSeek-V4 introduces an architectural shift by transforming million-token context processing into a serving-systems challenge, utilizing a hybrid attention design that compresses context before key-value (KV) storage. This model employs Compressed Sparse Attention (CSA), Heavily Compressed Attention (HCA), and Sliding Window Attention (SWA) to manage large context windows efficiently by reducing KV cache pressure, which is crucial for supporting long-context, decode-heavy workloads like coding and research agents. The new approach enables better batching, prefix reuse, and memory management, allowing more efficient use of NVIDIA HGX B200 platforms, which handle the compressed cache layouts across concurrent requests. DeepSeek-V4's design requires managing multiple cache types with different memory management strategies, making it essential to evaluate cache policies and endpoint profiles for varied workloads. The architectural advancements in V4, while promising improved serving efficiency for long-context tasks, necessitate careful benchmarking and tuning to realize performance gains across different workload regimes, particularly when migrating from short-chat applications to those requiring extensive context handling.