Fine-tuning language models over slow networks using activation compression with guarantees

Company

Together AI

Date Published

June 2, 2023

Author

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang

Word count

336

Language

English

Hacker News points

None

URL

www.together.ai/blog/fine-tuning-language-models-over-slow-networks-using-activation-compression-with-guarantees

Summary

Fine-tuning language models over slow networks using activation compression with guarantees` AC-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks, compresses the changes of activations instead of values, achieving O(1/T‾‾√) convergence rate without assuming gradient unbiasedness. AC-SGD can be optimized and implemented efficiently, providing up to 4.3X end-to-end speed-up in slower networks without sacrificing model quality. When combined with state-of-the-art gradient compression algorithms, AC-SGD enables "end-to-end communication compression" for significant speed-ups, with up to 4.9X improvement. This technique offers a cost-effective approach (20% faster training) and can be applied to large-scale models (up to 1.5 billion parameters), making it suitable for various applications, including those requiring high-quality datasets like RedPajama-V2.