Based: Simple linear attention language models balance the recall‑throughput tradeoff

Post Details

Company

Cartesia

Date Published

March 4, 2024

Author

Sabri Eyuboglu

Word Count

2,348

Language

English

Hacker News Points

-

Source URL

cartesia.ai/blog/based

Summary

The research paper discusses the design and advantages of "Based," a new recurrent architecture developed to address the recall-memory tradeoff in language models. Based outperforms other efficient architectures such as Mamba and RWKV in recall-intensive tasks while maintaining fast generation speeds, achieving a 24x higher throughput than FlashAttention-2. The architecture leverages two attention techniques: sliding window attention for local token interactions and linear attention for global interactions, using a Taylor series approximation to keep the recurrent state size fixed. This allows Based to traverse the pareto frontier of the recall-memory tradeoff effectively. Despite its simplicity, Based shows strong performance on real-world language modeling tasks and curated recall-intensive benchmarks, although it still lags behind the strongest Transformer baselines in some cases. The research emphasizes the importance of efficient hardware usage and introduces new IO-aware algorithms and a CUDA DSL called ThunderKittens to enhance Based's performance.