Muon vs MuonClip vs Muon+AdamW for Fine-Tuning
Blog post from HuggingFace
Muon, an innovative optimizer in large language model (LLM) training, has gained recognition for its applicability in real-world scenarios beyond benchmarks, notably being validated in large-scale models like Kimi K2. This blog post explores the effectiveness of Muon for fine-tuning, testing three variants: Muon Only, Muon+AdamW, and MuonClip. The experiments, conducted on the Qwen3 4B model using 10k rows of data, reveal that the Muon+AdamW hybrid optimizer outperforms both Muon Only and MuonClip, as well as the traditional AdamW. The hybrid's superior performance is attributed to its stability, particularly in handling spikes in gradient norms. The MuonClip's QK Norm Clipping, although stabilizing for extensive pre-training runs, is less effective for small-scale fine-tuning. Future experiments aim to scale the study to larger models and datasets to assess the long-term viability of these optimizers.