Mamba-3B-SlimPJ: State-space models rivaling the best Transformer architecture

Company

Together AI

Date Published

Dec. 12, 2023

Author

Tri Dao, Albert Gu

Word count

550

Language

English

Hacker News points

None

URL

www.together.ai/blog/mamba-3b-slimpj

Summary

Mamba-3B-SlimPJ has emerged as a strong contender to Transformers, with linear scaling in sequence length and fast inference, rivaling some of the best 3B Transformer architectures. The Mamba model was trained on 600B tokens on the SlimPajama dataset, under the Apache 2 license, and matches the quality of some of the best 3B Transformers such as BTLM-3B-8K with 17% fewer FLOPs. Mamba is a promising architecture for building foundation models, particularly in diverse applications like language, genomics, audio, and video. The model's training details include using the same hyperparameters as Mamba-3B on the Pile dataset but with a longer learning rate decay to accommodate more tokens. Evaluations show that Mamba-3B-SlimPJ matches the quality of very strong Transformers with 17% fewer training FLOPs and can be used for experimentation, understanding, chat, and instruction-tuned models.