Consistency diffusion language models: Up to 14x faster inference without sacrificing quality
Blog post from Together AI
Diffusion Language Models (DLMs) offer a promising alternative to traditional autoregressive language models by allowing for parallel generation and leveraging bidirectional context for tasks such as text infilling and refinement. However, standard DLMs face inefficiencies due to the incompatibility of KV caching with full bidirectional attention and the need for numerous refinement steps to maintain quality, which is both computationally expensive and time-consuming. The CDLM approach addresses these bottlenecks by employing a training-based acceleration scheme that enforces within-block temporal consistency and utilizes a block-wise causal mask, enabling exact KV caching and reducing the number of refinement steps without significant loss of accuracy. This method leads to substantial latency improvements and increased throughput, making CDLM particularly effective for math and coding tasks by achieving faster inference, fewer steps, and higher efficiency, all while maintaining competitive accuracy. Through system-level analysis, CDLM is shown to strike a balance between computational intensity and memory use, making it an efficient choice for small-batch settings.