Introducing Together AI Chief Scientist Tri Dao, as he releases FlashAttention-2 to speed up model training and inference

Company

Together AI

Date Published

July 17, 2023

Author

Together

Word count

2001

Language

English

Hacker News points

None

URL

www.together.ai/blog/tri-dao-flash-attention

Summary

Tri Dao, a Chief Scientist at Together AI, has released FlashAttention-2, an algorithm designed to speed up training and inference of large language models by up to 4x and achieving 72% model FLOPs utilization on NVIDIA A100 GPUs. The new version is built from scratch using primitives from NVIDIA's CUTLASS 3.x and its core library CuTe, providing clean abstractions and powerful building blocks for maximum speed. FlashAttention-2 achieves a 2x speedup over the previous implementation, reaching up to 230 TFLOPs/s on A100 GPUs, and is available in open source on Github. The algorithm is designed to work with existing models and can be used for training, fine-tuning, and inference of large language models. With its improvements, FlashAttention-2 enables models with twice as long context lengths while maintaining an interactive experience, making it a significant breakthrough in the field of natural language processing.