Run Gemma 4 on Intel® Xeon® Out-Of-the-Box

Post Details

Company

Hugging Face

Date Published

April 1, 2026

Author

Jiang Li, Xinyu Chen, Chendi Xue, FanZhao, Yi Wang, Wuxun Zhang, Alex Gu, Xinyi Li, jianan, Yintong Lu, and Matrix Yao

Word Count

1,464

Company Posts That Month

61

Language

-

Hacker News Points

-

Post removed?

No

Source URL

huggingface.co/blog/MatrixYao/xeon

Summary

Intel® Xeon® CPUs, with their Advanced Matrix Extensions (AMX), are increasingly favored for AI inference, providing cost-effective and efficient solutions for small to medium-sized models. The CPUs enhance inference speeds for BF16 and INT8 data types, making them a viable option for enterprises with existing Xeon servers. Intel's collaboration with open-source communities has led to kernel optimizations and feature enhancements in AI frameworks like PyTorch, Hugging Face transformers, vLLM, and SGLang, ensuring seamless integration and performance. Gemma 4 models, which utilize sliding and full attention mechanisms, run efficiently on Xeon CPUs, supported by vLLM's built-in CPUAttention backend and Hugging Face transformers' PyTorch kernels. The Gemma4MoE, Vision Tower, and Audio Tower components are optimized for Intel® Xeon® CPUs using upstreamed FusedMoE kernels. Additionally, setting up the environment for vLLM and Hugging Face transformers involves Docker and Python configurations, facilitating model operations like text, image, and audio captioning, with options for tensor parallelism to handle larger models effectively.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.