FlashAttention: Fast and memory-efficient exact attention with IO-Awareness

Company

Together AI

Date Published

May 17, 2023

Author

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

Word count

347

Language

English

Hacker News points

None

URL

www.together.ai/blog/flashattentionfandm

Summary

FlashAttention: Fast and memory-efficient exact attention with IO-Awareness` The proposed FlashAttention algorithm improves the efficiency of Transformers by making attention algorithms IO-aware, accounting for reads and writes between GPU high bandwidth memory (HBM) and on-chip SRAM. This reduces the number of memory accesses required by standard attention methods, resulting in a 15% end-to-end wall-clock speedup on BERT-large. FlashAttention enables longer context in Transformers, yielding higher quality models with better perplexity and accuracy, as well as entirely new capabilities such as achieving better-than-chance performance on long-range classification tasks. The algorithm's efficiency is further enhanced through network compression, reducing the required training time by 4x.