The Optimal Architecture for Small Language Models

Post Details

Company

HuggingFace

Date Published

Dec. 26, 2025

Author

Asankhaya Sharma

Word Count

2,348

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/codelion/optimal-model-architecture

Summary

Research into small language models has revealed that architecture choices, while important, may not be as critical as previously thought when working with models around 70 million parameters. Experiments conducted on 19 model configurations across 12 architecture families showed that a hidden dimension threshold of 512 is essential for optimal performance, and a depth of 32 layers—referred to as the "Goldilocks depth"—provides the best results when the number of parameters is fixed. Surprisingly, all architectures tested, including GPT-2 variants and newer models like LLaMA3 and Gemma3, performed similarly in terms of benchmark accuracy. However, diffusion models emerged as significantly faster, offering 3.8 times the throughput of traditional autoregressive models, with notable improvements in factuality performance. The introduction of Dhara-70M, a diffusion model created from an autoregressive architecture using the efficient Warmup-Stable-Decay method, exemplifies how these findings can be applied to create models that balance speed, factual accuracy, and computational efficiency.