Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

The Optimal Architecture for Small Language Models

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Asankhaya Sharma
Word Count
2,348
Language
-
Hacker News Points
-
Summary

Research into small language models has revealed that architecture choices, while important, may not be as critical as previously thought when working with models around 70 million parameters. Experiments conducted on 19 model configurations across 12 architecture families showed that a hidden dimension threshold of 512 is essential for optimal performance, and a depth of 32 layers—referred to as the "Goldilocks depth"—provides the best results when the number of parameters is fixed. Surprisingly, all architectures tested, including GPT-2 variants and newer models like LLaMA3 and Gemma3, performed similarly in terms of benchmark accuracy. However, diffusion models emerged as significantly faster, offering 3.8 times the throughput of traditional autoregressive models, with notable improvements in factuality performance. The introduction of Dhara-70M, a diffusion model created from an autoregressive architecture using the efficient Warmup-Stable-Decay method, exemplifies how these findings can be applied to create models that balance speed, factual accuracy, and computational efficiency.