1.5x Faster MoE Training on Blackwell with MXFP8 Kernels Built from Scratch

Company

Cursor

Date Published

Aug. 19, 2025

Author

Stuart

Word count

4763

Language

English

Hacker News points

None

URL

cursor.com/en/blog/kernels

Summary

The blog post discusses a significant optimization achieved in training large language models on Blackwell GPUs (B200s) using custom MXFP8 kernels, resulting in a 1.5x overall training speedup compared to the previous Hopper GPU (H100) setup. By identifying the Mixture-of-Experts (MoE) layer as a key bottleneck, the team rewrote it from scratch at the GPU kernel level with minimal dependencies, leveraging pure CUDA and PTX. They introduced microscaling with MXFP8 data formats to reduce computational costs, achieving a 3.5x improvement in MoE layer performance with almost no loss in training quality. The transition to block-scaled matrix multiplication and the development of a highly optimized MXFP8 quantization kernel were crucial in maximizing the performance on Blackwell GPUs, overcoming challenges related to new tensor memory architecture and quantization overhead. This comprehensive enhancement allows their stack to outperform open-source alternatives, paving the way for faster MoE training and future exploration of FP4 precision.