Chasing 6+ TB/s: an MXFP8 quantizer on Blackwell

Post Details

Company

Fal

Date Published

Jan. 27, 2026

Author

Yigithan Yigit

Word Count

1,676

Language

English

Hacker News Points

-

Source URL

blog.fal.ai/chasing-6-tb-s-an-mxfp8-quantizer-on-blackwell

Summary

The MXFP8 quantizer developed in CuTeDSL achieves a bandwidth of over 6 TB/s on the B200 by writing scale factors directly into the packed layout required by Blackwell's block-scaled Tensor Cores, eliminating the need for an additional packing step. The quantizer employs a microscaling format with block-based scaling, using a power-of-two scale exponent for each 32-element block, and outputs FP8 E4M3 values. Key optimizations include splitting the workload over K to increase parallelism, employing Tensor Memory Accelerator (TMA) for efficient data transfer from HBM to SMEM, and packing scale bytes into larger storage units to enhance store efficiency. These adjustments led to significant performance improvements, overcoming initial challenges related to CTA mapping and store utilization, and ultimately maintaining high effective bandwidth while aligning with the specific memory layout expectations of block-scaled GEMMs.