Flash-Decoding is a technique that significantly speeds up attention during inference for large language models, bringing up to 8x faster generation for very long sequences. It works by splitting the keys and values in smaller chunks, computing the attention of the query with each split in parallel using FlashAttention, and then combining the results to perform the final reduction. This approach unlocks up to 8x speedups in decoding speed for very large sequences and scales much better than alternative approaches.