Easily Build and Share ROCm Kernels with Hugging Face

Post Details

Company

HuggingFace

Date Published

Nov. 17, 2025

Author

Abdennacer Badaoui, Daniel Huang, colorswind, and Zesen Liu

Word Count

3,120

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/build-rocm-kernels

Summary

Custom kernels are essential for high-performance deep learning, allowing GPU operations tailored to specific workloads, such as image processing or tensor transformations. The process of compiling these kernels for different architectures and integrating them into PyTorch extensions can be challenging, but Hugging Face’s kernel-builder and kernels libraries simplify this by providing support for multiple GPU backends, including ROCm for AMD GPUs. This guide focuses on creating, testing, and sharing ROCm-compatible kernels, using the RadeonFlow GEMM kernel as an example. This kernel is optimized for the AMD Instinct MI300X GPU, using a low-precision FP8 format to enhance throughput and reduce memory bandwidth while maintaining accuracy through per-block scaling. The guide explains how to structure projects, configure build files, and integrate custom kernels as native PyTorch operators, leveraging tools like Nix for reproducibility. Once built, these kernels can be shared on the Hugging Face Hub, making them readily accessible for community use.