Company
Date Published
Author
Together
Word count
2211
Language
English
Hacker News points
None

Summary

At Together, researchers are working to bring the world's computation together in a decentralized cloud to enable AI research and improvement. Decentralized training of foundation models poses significant challenges due to high network bandwidth requirements. To address these issues, two papers were presented at NeurIPS 2022 focusing on optimizing efficiency for decentralized training over slow networks. The first paper addresses scheduling in heterogeneous environments, while the second focuses on communication compression. By optimizing the efficiency of decentralized training, researchers aim to significantly reduce costs associated with training foundation models. In an empirical study, the authors found that even with a network 100x slower than data center networks, the end-to-end training throughput is only 1.7-2.3x slower for GPT-style models with 1.3B parameters. The results demonstrate the potential of scheduling and system optimizations to bridge the gap between decentralized and data center training. However, beyond scheduling, other aspects such as fault tolerance, network jitters, heterogeneity on devices, and communication compression require attention for end-to-end systems.