Flash-Decoding for long-context inference

Company

Together AI

Date Published

Oct. 12, 2023

Author

Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov

Word count

1271

Language

English

Hacker News points

None

URL

www.together.ai/blog/flash-decoding-for-long-context-inference

Summary

Flash-Decoding is a technique that significantly speeds up attention during inference for large language models, bringing up to 8x faster generation for very long sequences. It works by splitting the keys and values in smaller chunks, computing the attention of the query with each split in parallel using FlashAttention, and then combining the results to perform the final reduction. This approach unlocks up to 8x speedups in decoding speed for very large sequences and scales much better than alternative approaches.