Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Doctor Shotgun and Geechan
Word Count
2,508
Language
-
Hacker News Points
-
Summary

The article explores the optimization of local mixture-of-experts (MoE) models using both CPU and GPU resources, specifically focusing on the software llama.cpp and its fork, ik_llama.cpp. MoE models like DeepSeek V3 and GLM 4.X are highly parameterized, but only a fraction of parameters are active during each forward pass. To enhance performance, the guide recommends offloading "always active" parameters to the GPU while assigning routed expert parameters to the CPU. It discusses techniques for optimizing weight offloading and prompt processing, emphasizing the importance of batch sizes and VRAM management. The article also highlights specific flags and commands for improving performance in multi-GPU setups, including graph mode split and NUMA optimizations for multi-socket CPUs. Additionally, it provides insights into advanced configurations and performance testing using ik_llama.cpp, which is designed for improved CPU/CUDA hybrid performance.