How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience

Post Details

Company

Modular

Date Published

Jan. 14, 2026

Author

David Robertson

Word Count

1,750

Language

English

Hacker News Points

-

Source URL

www.modular.com/blog/how-to-beat-unsloth-s-cuda-kernel-using-mojo-with-zero-gpu-experience

Summary

David Robertson, a member of the Mojo community, shares his experience tackling a CUDA kernel quantization challenge without prior GPU experience, achieving performance gains over existing C++/CUDA implementations. He took on the Unsloth NF4 dequantization puzzle and managed to surpass the reference time using Mojo, a programming tool with Python-like syntax designed to simplify GPU programming. Despite a slow start, Robertson utilized AI tools like ChatGPT and systematic experimentation to optimize his kernel, ultimately achieving significant speed improvements on various GPUs, including the Tesla T4 and L4. The process highlighted the importance of hardware-specific optimizations and how Mojo's straightforward approach facilitated rapid testing and iteration, demonstrating its potential to democratize GPU programming for those who find traditional tools complex.