Mixture of Experts (MoEs) in Transformers

Post Details

Company

HuggingFace

Date Published

Feb. 26, 2026

Author

Aritra Roy Gosthipaty, Pedro Cuenca, merve, Ilyas Moutawwakil, Arthur Zucker, Sergio Paniego, and Pablo Montalvo

Word Count

2,054

Company Posts That Month

55

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/moe-transformers

Summary

Mixture of Experts (MoEs) models in Transformers represent a significant shift from dense to sparse architectures, offering computational efficiency and scalability by utilizing learnable sub-networks, or "experts," which are selectively activated for different tokens. This approach allows for increased model capacity without proportionally increasing inference costs, as seen in models like gpt-oss-20b, which uses a small subset of its experts per token. The implementation of MoEs in Transformers involves significant changes to the model loading and execution pipeline, such as the introduction of a dynamic weight loading system via a WeightConverter, which efficiently packs expert weights into a single tensor for runtime execution. The Experts Backend system further enhances the flexibility of MoEs by enabling the selection of various execution strategies based on workload requirements. Expert Parallelism allows these models to scale beyond single-device constraints by distributing experts across multiple devices, leveraging innovations like GroupedGemmParallel and RouterParallel for efficient computation. The training of MoEs, while complex due to their massive parameter count and routing instabilities, is optimized through collaboration with Unsloth, achieving significant speed and memory efficiency improvements. The Transformers library continues to evolve to support these sparse architectures, integrating new abstractions and backend optimizations to facilitate the development and deployment of advanced MoE models.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
AI Model Fine-tuning	1	1,082	151	57	+103%
LLM	1	5,138	781	181	+34%