Making LLMs Smaller Without Breaking Them: A GLU-Aware Pruning Approach

Company

HuggingFace

Date Published

Nov. 24, 2024

Author

Pere Martra

Word count

3670

Language

Hacker News points

None

URL

huggingface.co/blog/oopere/making-llms-smaller-without-breaking-them

Summary

Pruning has emerged as a critical technique for reducing the size of large language models (LLMs) while maintaining their functionality, with a focus on structured width pruning in models with Gated Linear Unit (GLU) architecture, such as LLaMA 3.2. This process involves selectively removing neurons from MLP layers to decrease the model's size, ensuring that essential capabilities remain intact. The article outlines a method for pruning that respects the GLU structure, demonstrating significant size reduction without compromising performance on tasks like BoolQ, although challenges remain in tasks requiring broad context understanding, such as Lambada. The approach includes assessing neuron importance and carefully adjusting layer dimensions, resulting in a model that retains coherence and can be further optimized through techniques like knowledge distillation. The exploration of depth pruning and capability recovery processes suggests areas for future research to enhance the efficiency and performance of pruned models, making them more accessible for deployment without extensive infrastructure.