Optimizing LLM Training with Airbnb's Next-Gen ML Platform

Company

Anyscale

Date Published

Oct. 30, 2023

Author

Anyscale Ray Team

Word count

1048

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/optimizing-llm-training-with-airbnbs-next-gen-ml-platform

Summary

Airbnb's journey in integrating advanced ML technologies into its infrastructure highlights the importance of optimizing ML platforms to keep pace with rapidly evolving AI and ML technologies. The company initially faced gaps in Kubernetes' ability to support ML workloads, but adopted Ray to address these issues, leveraging its benefits such as easy prototypes locally, remote execution with dynamic runtime control, and support for latest ML frameworks like PyTorch and optimization libraries like DeepSpeed. To achieve cost efficiency, Airbnb built a fully elastic Ray cluster on AWS using auto-scaling groups, Kubernetes cluster auto-scaler, and KubeRay. They also enabled high throughput networking across workers using AWS EFA and RDMA to train models up to 12B parameters on 8x A100 GPUs, achieving 150 TFLOPS per A100 GPU on benchmarks. Looking ahead, they plan to investigate model parallelism for 30B+ parameter models, integrate Aviary for serving cost savings, and consolidate training engines. The company's experience showcases the need to continually enhance ML platforms to stay competitive in the rapidly evolving landscape of AI and ML technologies.