Company
Date Published
Author
Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang
Word count
336
Language
English
Hacker News points
None

Summary

Fine-tuning language models over slow networks using activation compression with guarantees` AC-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks, compresses the changes of activations instead of values, achieving O(1/T‾‾√) convergence rate without assuming gradient unbiasedness. AC-SGD can be optimized and implemented efficiently, providing up to 4.3X end-to-end speed-up in slower networks without sacrificing model quality. When combined with state-of-the-art gradient compression algorithms, AC-SGD enables "end-to-end communication compression" for significant speed-ups, with up to 4.9X improvement. This technique offers a cost-effective approach (20% faster training) and can be applied to large-scale models (up to 1.5 billion parameters), making it suitable for various applications, including those requiring high-quality datasets like RedPajama-V2.