Muon vs MuonClip vs Muon+AdamW for Fine-Tuning

Post Details

Company

HuggingFace

Date Published

Dec. 9, 2025

Author

Nishith Jain

Word Count

705

Company Posts That Month

48

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/KingNish/optimizer-part1

Summary

Muon, an innovative optimizer in large language model (LLM) training, has gained recognition for its applicability in real-world scenarios beyond benchmarks, notably being validated in large-scale models like Kimi K2. This blog post explores the effectiveness of Muon for fine-tuning, testing three variants: Muon Only, Muon+AdamW, and MuonClip. The experiments, conducted on the Qwen3 4B model using 10k rows of data, reveal that the Muon+AdamW hybrid optimizer outperforms both Muon Only and MuonClip, as well as the traditional AdamW. The hybrid's superior performance is attributed to its stability, particularly in handling spikes in gradient norms. The MuonClip's QK Norm Clipping, although stabilizing for extensive pre-training runs, is less effective for small-scale fine-tuning. Future experiments aim to scale the study to larger models and datasets to assess the long-term viability of these optimizers.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
AI Model Fine-tuning	6	603	116	61	+8%
LLM	1	3,775	638	202	-32%
Reinforcement learning	1	132	49	26	-55%