Differential Transformer V2

Post Details

Company

HuggingFace

Date Published

Jan. 20, 2026

Author

Li Dong

Word Count

3,136

Company Posts That Month

56

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/microsoft/diff-attn-v2

Summary

The Differential Transformer V2 (DIFF V2) introduces several enhancements over its predecessor, DIFF V1, aimed at improving the efficiency and stability of large language model (LLM) decoding and training. DIFF V2 increases the number of query heads while maintaining the number of key-value heads, thereby aligning head dimensions and eliminating the need for custom attention kernels, which enhances decoding speed and reduces memory usage. By removing the per-head RMSNorm, DIFF V2 mitigates the gradient spikes and numerical instability seen in DIFF V1, especially at large learning rates. It employs a projected lambda for each token and head, allowing for better control of context RMS, thus eliminating attention sinks and enhancing training stability. Experimental observations show DIFF V2 achieves lower language modeling loss and reduced activation outlier magnitude compared to baseline Transformers. The design focuses on leveraging a differential operation that saves parameters, allowing them to be reallocated elsewhere in the model, and demonstrates potential for scalable and stable training in large-scale LLMs.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Serverless	5	707	172	77	-35%
LLM	2	3,836	662	193	+2%