Company
Date Published
Author
Jiqing.Feng, Matrix Yao, Ke Ding, and Ilyas Moutawwakil
Word count
1374
Language
-
Hacker News points
None

Summary

Intel and Hugging Face have collaborated to demonstrate significant improvements in cost efficiency and performance for large Mixture of Experts (MoE) models, such as the OpenAI GPT OSS, by upgrading to Google Cloud's C4 Virtual Machines powered by Intel Xeon 6 processors. The C4 VMs showed a 1.7x improvement in Total Cost of Ownership (TCO) and 1.4x to 1.7x increase in throughput per vCPU compared to the previous C3 VMs. These advancements were achieved through specific optimizations, including directing expert execution to reduce redundant computations, resulting in enhanced model performance. The benchmark tests, which focused on steady-state decoding and throughput across various batch sizes, confirmed that the new setup provides both higher throughput and lower latency, making large-scale MoE model inference more efficient on general-purpose CPUs.