Training 175B Parameter Language Models at 1000 GPU scale with Alpa and Ray

Company

Anyscale

Date Published

March 22, 2023

Author

Jiao Dong, Hao Zhang, Lianmin Zheng, Jun Gong, Jules S. Damji, Phi Nguyen

Word count

2713

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/training-175b-parameter-language-models-at-1000-gpu-scale-with-alpa-and-ray

Summary

The text discusses how two open-source frameworks, Alpa and Ray, integrate to achieve scale in training large language models (LLMs) like OPT-175B with pipeline parallelism up to 1024 A100 GPUs. Alpa automatically discovers and executes the best inter-op and intra-op parallelism for LLMs, while Ray is a unified framework for scaling AI and Python applications like machine learning. The integration of Alpa and Ray enables efficient training and inference of LLMs at scale, reducing scheduling frequency and overhead, and achieving high performance and scalability results, including peak HW FLOPs utilization of ~57.5% and ~179 TFLOPs/GPU.