Company
Date Published
Author
Together
Word count
2001
Language
English
Hacker News points
None

Summary

Tri Dao, a Chief Scientist at Together AI, has released FlashAttention-2, an algorithm designed to speed up training and inference of large language models by up to 4x and achieving 72% model FLOPs utilization on NVIDIA A100 GPUs. The new version is built from scratch using primitives from NVIDIA's CUTLASS 3.x and its core library CuTe, providing clean abstractions and powerful building blocks for maximum speed. FlashAttention-2 achieves a 2x speedup over the previous implementation, reaching up to 230 TFLOPs/s on A100 GPUs, and is available in open source on Github. The algorithm is designed to work with existing models and can be used for training, fine-tuning, and inference of large language models. With its improvements, FlashAttention-2 enables models with twice as long context lengths while maintaining an interactive experience, making it a significant breakthrough in the field of natural language processing.