BASED: Simple linear attention language models balance the recall-throughput tradeoff

Company

Together AI

Date Published

March 4, 2024

Author

Simran, Sabri, Michael, Aman, Silas, Dylan, James, Atri, Chris

Word count

2303

Language

English

Hacker News points

165

URL

www.together.ai/blog/based

Summary

BASED: Simple linear attention language models balance the recall-throughput tradeoff` Based, a simple efficient architecture combines sliding window attention and linear attention to offer high-quality language modeling with strong associative recall capabilities. At inference time, Based decodes without a KV-cache, enabling a 24x throughput improvement over Transformers with Flash-Attention 2! The Based architecture outperforms prior sub-quadratic models on real-world recall-intensive tasks and in-context learning, while offering fast generation speeds. By using just two well-known, familiar attention-like building blocks, sliding window attention (with tiny window sizes) and linear attention (with Taylor series approximation of exp(QK^T)), we can outperform the strongest sub-quadratic architectures on language modeling and achieve massive speedups over optimized Transformers! The choice of featurization matters as well, with our Taylor map being surprisingly simple yet effective. IO-aware algorithms for the Taylor linear attention forward pass and inference reduce data movement between slow HBM and SRAM, unlocking efficiency. Based achieves up to 24x higher throughput than FlashAttention-2 in next token prediction, making it a promising architecture for language modeling and other applications.