Home / Companies / Unsloth / Blog / Post Details
Content Deep Dive

Bugs in LLM Training - Gradient Accumulation Fix

Blog post from Unsloth

Post Details
Company
Date Published
Author
Daniel & Michael
Word Count
2,399
Language
English
Hacker News Points
-
Summary

Unsloth has developed a solution for a pervasive issue in gradient accumulation that affects training runs, pre-training, and fine-tuning of sequence models like large language models (LLMs). The problem, which was first identified in 2021, results in higher loss calculations when using gradient accumulation compared to full batch training. Unsloth's fix involves correctly scaling the gradients to ensure accurate loss calculations, making gradient accumulation equivalent to full batch training. The issue arises because naive gradient accumulation can introduce errors due to the normalization of cross-entropy loss by the sequence lengths, leading to discrepancies in loss calculations. By addressing these errors, Unsloth's updated methodology effectively mitigates the inherent penalty of floating point arithmetic in gradient accumulation, resulting in a significant reduction in L2 norm error. The fix is now available for users through a simple update of the Unsloth package, and Unsloth has collaborated with Hugging Face and other frameworks to integrate this solution into their training methodologies as well.