Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp

Post Details

Company

HuggingFace

Date Published

Jan. 30, 2026

Author

Doctor Shotgun and Geechan

Word Count

2,508

Company Posts That Month

56

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/Doctor-Shotgun/llamacpp-moe-offload-guide

Summary

The article explores the optimization of local mixture-of-experts (MoE) models using both CPU and GPU resources, specifically focusing on the software llama.cpp and its fork, ik_llama.cpp. MoE models like DeepSeek V3 and GLM 4.X are highly parameterized, but only a fraction of parameters are active during each forward pass. To enhance performance, the guide recommends offloading "always active" parameters to the GPU while assigning routed expert parameters to the CPU. It discusses techniques for optimizing weight offloading and prompt processing, emphasizing the importance of batch sizes and VRAM management. The article also highlights specific flags and commands for improving performance in multi-GPU setups, including graph mode split and NUMA optimizations for multi-socket CPUs. Additionally, it provides insights into advanced configurations and performance testing using ik_llama.cpp, which is designed for improved CPU/CUDA hybrid performance.

Trends Found in this Post

No tracked trend matches for this post yet.