Company
Date Published
Author
Anyscale Ray Team
Word count
902
Language
English
Hacker News points
None

Summary

Netflix's machine learning platform relies heavily on heterogeneous training clusters to power its recommendation and content personalization systems, utilizing Ray and GPU clusters for efficient model training, communication, and data management. The platform uses various ML models for recommendations and computer vision tasks, and employs strategies such as custom operators, state-of-the-art operators, and optimized GPU communication to optimize computation. Netflix also optimizes data storage and management using local SSD disks, S3 streaming, and FSx caching, and offloads data loading to remote CPUs with Ray to decouple it from GPU training. The platform uses durable heterogeneous clusters per team with autoscaling, jobs specify only the number of GPUs needed, and stores data in S3, syncs it to FSX for high-speed training access, and writes logs/checkpoints to EFS. Netflix is working on a centralized scheduler, exploring batch inference, and moving to fully scheduled job submission to maximize resource utilization and reduce contention between teams.