TFLOPS Gap: Why FP4 MoE Kernel Engineering Matters on Blackwell
Blog post from HuggingFace
The article explores the significance of kernel engineering in maximizing the performance of FP4 MoE (Mixture of Experts) models on NVIDIA's Blackwell B200 GPU, which supports native FP4. Through benchmarking three MoE backends—vLLM, SGLang, and FlashInfer CuteDSL—it demonstrates that SGLang achieves up to 3.54x speedup over BF16 and 1.32x over vLLM at batch size 1, emphasizing the importance of kernel fusion, Blackwell optimization, and adaptive grid sizing. Key optimizations include reducing memory passes, utilizing Blackwell-specific CUTLASS schedules, and maximizing SM occupancy, which collectively result in significant throughput and latency improvements, especially at smaller batch sizes crucial for interactive inference applications like chatbots. The findings highlight that while hardware support for FP4 is essential, the full potential is realized only through tailored kernel engineering that exploits Blackwell's unique features, suggesting that frameworks prioritizing such optimizations will lead future performance benchmarks.