Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Run Gemma 4 on Intel® Xeon® Out-Of-the-Box

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Jiang Li, Xinyu Chen, Chendi Xue, FanZhao, Yi Wang, Wuxun Zhang, Alex Gu, Xinyi Li, jianan, Yintong Lu, and Matrix Yao
Word Count
1,464
Language
-
Hacker News Points
-
Summary

Intel® Xeon® CPUs, with their Advanced Matrix Extensions (AMX), are increasingly favored for AI inference, providing cost-effective and efficient solutions for small to medium-sized models. The CPUs enhance inference speeds for BF16 and INT8 data types, making them a viable option for enterprises with existing Xeon servers. Intel's collaboration with open-source communities has led to kernel optimizations and feature enhancements in AI frameworks like PyTorch, Hugging Face transformers, vLLM, and SGLang, ensuring seamless integration and performance. Gemma 4 models, which utilize sliding and full attention mechanisms, run efficiently on Xeon CPUs, supported by vLLM's built-in CPUAttention backend and Hugging Face transformers' PyTorch kernels. The Gemma4MoE, Vision Tower, and Audio Tower components are optimized for Intel® Xeon® CPUs using upstreamed FusedMoE kernels. Additionally, setting up the environment for vLLM and Hugging Face transformers involves Docker and Python configurations, facilitating model operations like text, image, and audio captioning, with options for tensor parallelism to handle larger models effectively.