FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Post Details

Company

Together AI

Date Published

March 5, 2026

Author

Together AI

Word Count

3,416

Language

English

Hacker News Points

-

Source URL

www.together.ai/blog/flashattention-4

Summary

FlashAttention-4 is an advanced algorithm and kernel co-design that optimizes the performance of attention mechanisms on Blackwell B200 GPUs, particularly addressing the challenges posed by asymmetric hardware scaling where tensor core throughput significantly outpaces other resources. By implementing novel pipelining and scheduling techniques, FlashAttention-4 maximizes the overlap between matrix multiplications and other resource bottlenecks such as the softmax exponential computation and shared-memory traffic, ultimately achieving up to 1605 TFLOPs/s on B200 with BF16, surpassing the performance of cuDNN and Triton. Key innovations include a software emulation of the exponential function to alleviate bottlenecks, a new tile scheduler for load balance, and a backward pass design that reduces shared-memory traffic and atomic operations for deterministic execution. Implemented in CuTe-DSL, FlashAttention-4 demonstrates significant performance improvements in attention benchmarks, offering faster computation for large sequence lengths while maintaining high utilization of the Blackwell architecture's capabilities.