Fine-tuning language models over slow networks using activation compression with guarantees`
AC-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks, compresses the changes of activations instead of values, achieving O(1/T‾‾√) convergence rate without assuming gradient unbiasedness. AC-SGD can be optimized and implemented efficiently, providing up to 4.3X end-to-end speed-up in slower networks without sacrificing model quality. When combined with state-of-the-art gradient compression algorithms, AC-SGD enables "end-to-end communication compression" for significant speed-ups, with up to 4.9X improvement. This technique offers a cost-effective approach (20% faster training) and can be applied to large-scale models (up to 1.5 billion parameters), making it suitable for various applications, including those requiring high-quality datasets like RedPajama-V2.