STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning [Paper Reflection]

Post Details

Company

Neptune.ai

Date Published

July 23, 2025

Author

Seung-won Hwang

Word Count

1,433

Language

English

Hacker News Points

-

Source URL

neptune.ai/blog/stun-structured-then-unstructured-pruning-for-scalable-moe-pruning

Summary

Structured-Then-Unstructured Pruning (STUN) presents an innovative two-phase approach to enhance the scalability of Mixture-of-Experts (MoE) models by first implementing structured pruning to remove redundant experts and then applying unstructured pruning within individual experts. This technique addresses the inefficiencies and high computational demands associated with traditional methods of pruning large MoE models, such as Snowflake's Arctic, which consists of 128 experts. STUN effectively reduces the model size while maintaining performance, achieving high sparsity without loss in accuracy, particularly on complex tasks like GSM8K. This approach significantly outperforms both structured-only and unstructured-only pruning methods, presenting a scalable solution for MoE models by leveraging the behavioral similarity between experts to streamline pruning decisions. The paper suggests that STUN's generalizability to other MoE families and potential hardware acceleration for unstructuredly pruned models are promising directions for future research, aiming to optimize memory access and processing efficiency further.