Company
Date Published
Author
Dan Fu, Simran Arora, Chris RĂ©
Word count
1981
Language
English
Hacker News points
None

Summary

The researchers at Together AI have developed a new model architecture called Monarch Mixer, which aims to increase efficiency while maintaining quality in Transformers. The Monarch Mixer (M2) is a sub-quadratic approach that replaces the traditional Transformer architecture with a more efficient one, enabling it to scale more efficiently and train faster. The first target for M2 is BERT, the most popular model used for language tasks, and M2-BERT has been shown to be 25% more parameter-efficient than BERT while matching its quality. The researchers have also explored the potential of long-sequence models with Monarch Mixer, which could enable scaling to longer sequences without significant loss in performance. The code and checkpoints for M2-BERT are now available on GitHub, and further releases and updates will be made in the coming weeks.