Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Differential Transformer V2

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Li Dong
Word Count
3,136
Language
-
Hacker News Points
-
Summary

The Differential Transformer V2 (DIFF V2) introduces several enhancements over its predecessor, DIFF V1, aimed at improving the efficiency and stability of large language model (LLM) decoding and training. DIFF V2 increases the number of query heads while maintaining the number of key-value heads, thereby aligning head dimensions and eliminating the need for custom attention kernels, which enhances decoding speed and reduces memory usage. By removing the per-head RMSNorm, DIFF V2 mitigates the gradient spikes and numerical instability seen in DIFF V1, especially at large learning rates. It employs a projected lambda for each token and head, allowing for better control of context RMS, thus eliminating attention sinks and enhancing training stability. Experimental observations show DIFF V2 achieves lower language modeling loss and reduced activation outlier magnitude compared to baseline Transformers. The design focuses on leveraging a differential operation that saves parameters, allowing them to be reallocated elsewhere in the model, and demonstrates potential for scalable and stable training in large-scale LLMs.