TFLOPS Gap: Why FP4 MoE Kernel Engineering Matters on Blackwell

Post Details

Company

HuggingFace

Date Published

Jan. 5, 2026

Author

Konstantin

Word Count

3,309

Company Posts That Month

56

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/apsys/blackwell-nvfp4-comparison

Summary

The article explores the significance of kernel engineering in maximizing the performance of FP4 MoE (Mixture of Experts) models on NVIDIA's Blackwell B200 GPU, which supports native FP4. Through benchmarking three MoE backends—vLLM, SGLang, and FlashInfer CuteDSL—it demonstrates that SGLang achieves up to 3.54x speedup over BF16 and 1.32x over vLLM at batch size 1, emphasizing the importance of kernel fusion, Blackwell optimization, and adaptive grid sizing. Key optimizations include reducing memory passes, utilizing Blackwell-specific CUTLASS schedules, and maximizing SM occupancy, which collectively result in significant throughput and latency improvements, especially at smaller batch sizes crucial for interactive inference applications like chatbots. The findings highlight that while hardware support for FP4 is essential, the full potential is realized only through tailored kernel engineering that exploits Blackwell's unique features, suggesting that frameworks prioritizing such optimizations will lead future performance benchmarks.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	2	3,836	662	193	+2%
AI Model Fine-tuning	1	532	129	59	-12%