Company
Date Published
Author
Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov
Word count
1271
Language
English
Hacker News points
None

Summary

Flash-Decoding is a technique that significantly speeds up attention during inference for large language models, bringing up to 8x faster generation for very long sequences. It works by splitting the keys and values in smaller chunks, computing the attention of the query with each split in parallel using FlashAttention, and then combining the results to perform the final reduction. This approach unlocks up to 8x speedups in decoding speed for very large sequences and scales much better than alternative approaches.