Bugs in LLM Training - Gradient Accumulation Fix

Post Details

Company

Unsloth

Date Published

Oct. 15, 2024

Author

Daniel & Michael

Word Count

2,399

Language

English

Hacker News Points

-

Source URL

unsloth.ai/blog/gradient

Summary

Unsloth has developed a solution for a pervasive issue in gradient accumulation that affects training runs, pre-training, and fine-tuning of sequence models like large language models (LLMs). The problem, which was first identified in 2021, results in higher loss calculations when using gradient accumulation compared to full batch training. Unsloth's fix involves correctly scaling the gradients to ensure accurate loss calculations, making gradient accumulation equivalent to full batch training. The issue arises because naive gradient accumulation can introduce errors due to the normalization of cross-entropy loss by the sequence lengths, leading to discrepancies in loss calculations. By addressing these errors, Unsloth's updated methodology effectively mitigates the inherent penalty of floating point arithmetic in gradient accumulation, resulting in a significant reduction in L2 norm error. The fix is now available for users through a simple update of the Unsloth package, and Unsloth has collaborated with Hugging Face and other frameworks to integrate this solution into their training methodologies as well.