Home / Companies / Fal / Blog / Post Details
Content Deep Dive

Chasing 6+ TB/s: an MXFP8 quantizer on Blackwell

Blog post from Fal

Post Details
Company
Fal
Date Published
Author
Yigithan Yigit
Word Count
1,676
Language
English
Hacker News Points
-
Summary

The MXFP8 quantizer developed in CuTeDSL achieves a bandwidth of over 6 TB/s on the B200 by writing scale factors directly into the packed layout required by Blackwell's block-scaled Tensor Cores, eliminating the need for an additional packing step. The quantizer employs a microscaling format with block-based scaling, using a power-of-two scale exponent for each 32-element block, and outputs FP8 E4M3 values. Key optimizations include splitting the workload over K to increase parallelism, employing Tensor Memory Accelerator (TMA) for efficient data transfer from HBM to SMEM, and packing scale bytes into larger storage units to enhance store efficiency. These adjustments led to significant performance improvements, overcoming initial challenges related to CTA mapping and store utilization, and ultimately maintaining high effective bandwidth while aligning with the specific memory layout expectations of block-scaled GEMMs.