Create Mixtures of Experts with MergeKit

Post Details

Company

HuggingFace

Date Published

March 28, 2024

Author

Maxime Labonne

Word Count

2,007

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/mlabonne/frankenmoe

Summary

The Mixture of Experts (MoE) architecture has become increasingly popular due to innovations like Mixtral and new methods of creating MoEs using Arcee's MergeKit library. Traditional MoEs involve pre-training from scratch, but MergeKit enables the creation of "frankenMoEs" by combining multiple pre-trained models, allowing for improved performance and efficiency. These frankenMoEs utilize specialized subnetworks or "experts" that are selectively activated based on input, leading to faster training and more efficient inference. While this approach offers advantages, it also requires significant VRAM and presents challenges in fine-tuning and memory usage. The article explores the process of creating a frankenMoE with MergeKit, using a selection of experts to perform various tasks, resulting in a model like Beyonder-4x7B-v3 that performs well on several benchmarks. Despite some trade-offs, frankenMoEs show promise in preserving knowledge and producing robust models, with potential improvements depending on hardware capabilities.