Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

BASED: Simple linear attention language models balance the recall-throughput tradeoff

Blog post from Together AI

Post Details
Company
Date Published
Author
Simran, Sabri, Michael, Aman, Silas, Dylan, James, Atri, Chris
Word Count
2,303
Language
English
Hacker News Points
165
Summary

BASED: Simple linear attention language models balance the recall-throughput tradeoff` Based, a simple efficient architecture combines sliding window attention and linear attention to offer high-quality language modeling with strong associative recall capabilities. At inference time, Based decodes without a KV-cache, enabling a 24x throughput improvement over Transformers with Flash-Attention 2! The Based architecture outperforms prior sub-quadratic models on real-world recall-intensive tasks and in-context learning, while offering fast generation speeds. By using just two well-known, familiar attention-like building blocks, sliding window attention (with tiny window sizes) and linear attention (with Taylor series approximation of exp(QK^T)), we can outperform the strongest sub-quadratic architectures on language modeling and achieve massive speedups over optimized Transformers! The choice of featurization matters as well, with our Taylor map being surprisingly simple yet effective. IO-aware algorithms for the Taylor linear attention forward pass and inference reduce data movement between slow HBM and SRAM, unlocking efficiency. Based achieves up to 24x higher throughput than FlashAttention-2 in next token prediction, making it a promising architecture for language modeling and other applications.