Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Mixture of Experts (MoEs) in Transformers

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Aritra Roy Gosthipaty, Pedro Cuenca, merve, Ilyas Moutawwakil, Arthur Zucker, Sergio Paniego, and Pablo Montalvo
Word Count
2,054
Language
-
Hacker News Points
-
Summary

Mixture of Experts (MoEs) models in Transformers represent a significant shift from dense to sparse architectures, offering computational efficiency and scalability by utilizing learnable sub-networks, or "experts," which are selectively activated for different tokens. This approach allows for increased model capacity without proportionally increasing inference costs, as seen in models like gpt-oss-20b, which uses a small subset of its experts per token. The implementation of MoEs in Transformers involves significant changes to the model loading and execution pipeline, such as the introduction of a dynamic weight loading system via a WeightConverter, which efficiently packs expert weights into a single tensor for runtime execution. The Experts Backend system further enhances the flexibility of MoEs by enabling the selection of various execution strategies based on workload requirements. Expert Parallelism allows these models to scale beyond single-device constraints by distributing experts across multiple devices, leveraging innovations like GroupedGemmParallel and RouterParallel for efficient computation. The training of MoEs, while complex due to their massive parameter count and routing instabilities, is optimized through collaboration with Unsloth, achieving significant speed and memory efficiency improvements. The Transformers library continues to evolve to support these sparse architectures, integrating new abstractions and backend optimizations to facilitate the development and deployment of advanced MoE models.