Company
Date Published
Author
Torsten Scholak, Oleksiy Ostapenko, Raymond Li, Luke Kumar, and Joel Lamy-Poirier
Word count
1709
Language
-
Hacker News points
None

Summary

In a bid to enhance the efficiency of their 15B reasoning model without compromising its quality, the team at ServiceNow-AI developed a hybrid model named Apriel-H1 by integrating Mamba layers. This process involved a novel insight: distilling the model using high-quality, task-specific data that preserves the reasoning capabilities, rather than relying on pretraining data. By implementing a staged distillation approach, they progressively replaced attention layers with Mamba layers, achieving up to 2.1x throughput with minimal quality loss. The effort culminated in the Apriel-H1-15b-Thinker-SFT model, which maintained reasoning quality across benchmarks. The Fast-LLM framework facilitated this development, offering modularity that allows easy swapping of attention and Mamba layers. While the hybrid model presents significant efficiency gains, deploying it in production requires careful handling due to the current maturity of the tooling, and the team underscores the importance of matching distillation data to the specific capability being preserved.