Company
Date Published
Author
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher RĂ©
Word count
347
Language
English
Hacker News points
None

Summary

FlashAttention: Fast and memory-efficient exact attention with IO-Awareness` The proposed FlashAttention algorithm improves the efficiency of Transformers by making attention algorithms IO-aware, accounting for reads and writes between GPU high bandwidth memory (HBM) and on-chip SRAM. This reduces the number of memory accesses required by standard attention methods, resulting in a 15% end-to-end wall-clock speedup on BERT-large. FlashAttention enables longer context in Transformers, yielding higher quality models with better perplexity and accuracy, as well as entirely new capabilities such as achieving better-than-chance performance on long-range classification tasks. The algorithm's efficiency is further enhanced through network compression, reducing the required training time by 4x.