Heterogeneous Training Cluster with Ray at Netflix

Company

Anyscale

Date Published

Oct. 20, 2023

Author

Anyscale Ray Team

Word count

902

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/heterogeneous-training-cluster-with-ray-at-netflix

Summary

Netflix's machine learning platform relies heavily on heterogeneous training clusters to power its recommendation and content personalization systems, utilizing Ray and GPU clusters for efficient model training, communication, and data management. The platform uses various ML models for recommendations and computer vision tasks, and employs strategies such as custom operators, state-of-the-art operators, and optimized GPU communication to optimize computation. Netflix also optimizes data storage and management using local SSD disks, S3 streaming, and FSx caching, and offloads data loading to remote CPUs with Ray to decouple it from GPU training. The platform uses durable heterogeneous clusters per team with autoscaling, jobs specify only the number of GPUs needed, and stores data in S3, syncs it to FSX for high-speed training access, and writes logs/checkpoints to EFS. Netflix is working on a centralized scheduler, exploring batch inference, and moving to fully scheduled job submission to maximize resource utilization and reduce contention between teams.