FlashAttention: Fast and memory-efficient exact attention with IO-Awareness`
The proposed FlashAttention algorithm improves the efficiency of Transformers by making attention algorithms IO-aware, accounting for reads and writes between GPU high bandwidth memory (HBM) and on-chip SRAM. This reduces the number of memory accesses required by standard attention methods, resulting in a 15% end-to-end wall-clock speedup on BERT-large. FlashAttention enables longer context in Transformers, yielding higher quality models with better perplexity and accuracy, as well as entirely new capabilities such as achieving better-than-chance performance on long-range classification tasks. The algorithm's efficiency is further enhanced through network compression, reducing the required training time by 4x.