EMO: Pretraining mixture of experts for emergent modularity
Blog post from HuggingFace
EMO is a newly released mixture-of-experts (MoE) model designed to foster emergent modularity without relying on human-defined priors, enabling efficient use of resources by activating only a small portion of its experts for specific tasks. Unlike traditional large language models, which operate as monolithic systems, EMO allows for the selective use of expert subsets, maintaining near full-model performance even when only 12.5% of its experts are engaged. This model aims to overcome the limitations of standard MoEs, which often specialize in low-level lexical patterns, by encouraging experts to form coherent groups that align with semantic domains. During pretraining, EMO uses document boundaries as a supervisory signal to ensure tokens from the same document activate similar experts, promoting domain specialization. The model's effectiveness is demonstrated through its ability to maintain performance on general-purpose benchmarks, even with reduced expert subsets, and its modular design supports flexible deployment with improved memory-accuracy trade-offs. EMO's architecture and training approach provide a foundation for developing modular language models that are easier to deploy, adapt, and interpret, facilitating further research into expert selection and composition.