FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Blog post from Together AI
FlashAttention-4 is an advanced algorithm and kernel co-design that optimizes the performance of attention mechanisms on Blackwell B200 GPUs, particularly addressing the challenges posed by asymmetric hardware scaling where tensor core throughput significantly outpaces other resources. By implementing novel pipelining and scheduling techniques, FlashAttention-4 maximizes the overlap between matrix multiplications and other resource bottlenecks such as the softmax exponential computation and shared-memory traffic, ultimately achieving up to 1605 TFLOPs/s on B200 with BF16, surpassing the performance of cuDNN and Triton. Key innovations include a software emulation of the exponential function to alleviate bottlenecks, a new tile scheduler for load balance, and a backward pass design that reduces shared-memory traffic and atomic operations for deterministic execution. Implemented in CuTe-DSL, FlashAttention-4 demonstrates significant performance improvements in attention benchmarks, offering faster computation for large sequence lengths while maintaining high utilization of the Blackwell architecture's capabilities.