Company
Date Published
Author
Anyscale Ray Team
Word count
1048
Language
English
Hacker News points
None

Summary

Airbnb's journey in integrating advanced ML technologies into its infrastructure highlights the importance of optimizing ML platforms to keep pace with rapidly evolving AI and ML technologies. The company initially faced gaps in Kubernetes' ability to support ML workloads, but adopted Ray to address these issues, leveraging its benefits such as easy prototypes locally, remote execution with dynamic runtime control, and support for latest ML frameworks like PyTorch and optimization libraries like DeepSpeed. To achieve cost efficiency, Airbnb built a fully elastic Ray cluster on AWS using auto-scaling groups, Kubernetes cluster auto-scaler, and KubeRay. They also enabled high throughput networking across workers using AWS EFA and RDMA to train models up to 12B parameters on 8x A100 GPUs, achieving 150 TFLOPS per A100 GPU on benchmarks. Looking ahead, they plan to investigate model parallelism for 30B+ parameter models, integrate Aviary for serving cost savings, and consolidate training engines. The company's experience showcases the need to continually enhance ML platforms to stay competitive in the rapidly evolving landscape of AI and ML technologies.