Company
Date Published
Author
Stuart
Word count
4763
Language
English
Hacker News points
None

Summary

The blog post discusses a significant optimization achieved in training large language models on Blackwell GPUs (B200s) using custom MXFP8 kernels, resulting in a 1.5x overall training speedup compared to the previous Hopper GPU (H100) setup. By identifying the Mixture-of-Experts (MoE) layer as a key bottleneck, the team rewrote it from scratch at the GPU kernel level with minimal dependencies, leveraging pure CUDA and PTX. They introduced microscaling with MXFP8 data formats to reduce computational costs, achieving a 3.5x improvement in MoE layer performance with almost no loss in training quality. The transition to block-scaled matrix multiplication and the development of a highly optimized MXFP8 quantization kernel were crucial in maximizing the performance on Blackwell GPUs, overcoming challenges related to new tensor memory architecture and quantization overhead. This comprehensive enhancement allows their stack to outperform open-source alternatives, paving the way for faster MoE training and future exploration of FP4 precision.