NeurIPS 2022: Overcoming communication bottlenecks for decentralized training (1/2)

Company

Together AI

Date Published

Nov. 30, 2022

Author

Together

Word count

2211

Language

English

Hacker News points

None

URL

www.together.ai/blog/neurips-2022-overcoming-communication-bottlenecks-for-decentralized-training-12

Summary

At Together, researchers are working to bring the world's computation together in a decentralized cloud to enable AI research and improvement. Decentralized training of foundation models poses significant challenges due to high network bandwidth requirements. To address these issues, two papers were presented at NeurIPS 2022 focusing on optimizing efficiency for decentralized training over slow networks. The first paper addresses scheduling in heterogeneous environments, while the second focuses on communication compression. By optimizing the efficiency of decentralized training, researchers aim to significantly reduce costs associated with training foundation models. In an empirical study, the authors found that even with a network 100x slower than data center networks, the end-to-end training throughput is only 1.7-2.3x slower for GPT-style models with 1.3B parameters. The results demonstrate the potential of scheduling and system optimizations to bridge the gap between decentralized and data center training. However, beyond scheduling, other aspects such as fault tolerance, network jitters, heterogeneity on devices, and communication compression require attention for end-to-end systems.