Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

Company

HuggingFace

Date Published

Nov. 19, 2025

Author

Torsten Scholak, Oleksiy Ostapenko, Raymond Li, Luke Kumar, and Joel Lamy-Poirier

Word count

1709

Language

Hacker News points

None

URL

huggingface.co/blog/ServiceNow-AI/apriel-h1

Summary

In a bid to enhance the efficiency of their 15B reasoning model without compromising its quality, the team at ServiceNow-AI developed a hybrid model named Apriel-H1 by integrating Mamba layers. This process involved a novel insight: distilling the model using high-quality, task-specific data that preserves the reasoning capabilities, rather than relying on pretraining data. By implementing a staged distillation approach, they progressively replaced attention layers with Mamba layers, achieving up to 2.1x throughput with minimal quality loss. The effort culminated in the Apriel-H1-15b-Thinker-SFT model, which maintained reasoning quality across benchmarks. The Fast-LLM framework facilitated this development, offering modularity that allows easy swapping of attention and Mamba layers. While the hybrid model presents significant efficiency gains, deploying it in production requires careful handling due to the current maturity of the tooling, and the team underscores the importance of matching distillation data to the specific capability being preserved.