Company
Date Published
Author
Simran, Sabri, Michael, Aman, Silas, Dylan, James, Atri, Chris
Word count
2303
Language
English
Hacker News points
165

Summary

BASED: Simple linear attention language models balance the recall-throughput tradeoff` Based, a simple efficient architecture combines sliding window attention and linear attention to offer high-quality language modeling with strong associative recall capabilities. At inference time, Based decodes without a KV-cache, enabling a 24x throughput improvement over Transformers with Flash-Attention 2! The Based architecture outperforms prior sub-quadratic models on real-world recall-intensive tasks and in-context learning, while offering fast generation speeds. By using just two well-known, familiar attention-like building blocks, sliding window attention (with tiny window sizes) and linear attention (with Taylor series approximation of exp(QK^T)), we can outperform the strongest sub-quadratic architectures on language modeling and achieve massive speedups over optimized Transformers! The choice of featurization matters as well, with our Taylor map being surprisingly simple yet effective. IO-aware algorithms for the Taylor linear attention forward pass and inference reduce data movement between slow HBM and SRAM, unlocking efficiency. Based achieves up to 24x higher throughput than FlashAttention-2 in next token prediction, making it a promising architecture for language modeling and other applications.