Home / Companies / Anyscale / Blog / Post Details
Content Deep Dive

How ByteDance Scales Offline Inference with multi-modal LLMs to 200 TB Data

Blog post from Anyscale

Post Details
Company
Date Published
Author
Amog Kamsetty, Hao Chen, Liguang Xie
Word Count
1,872
Language
English
Hacker News Points
7
Summary

We leverage multi-modal models to enable applications such as text-based image retrieval or object detection, powering various use cases at ByteDance, including large model offline inference. To handle the scale of our workload, we utilize Ray as a computing framework, specifically Ray Data, which provides flexibility and scalability for large-scale model parallel inference. We employ pipeline sharding, splitting our model across GPU devices to fit within memory constraints. This approach allows us to overcome technical challenges posed by data size and model size. By utilizing Ray Data's streaming execution paradigm and elastic resource scheduling, we can achieve high efficiency in building scalable offline inference applications for large models. Additionally, we leverage KubeRay to manage our Ray clusters, providing a comprehensive solution for deploying and managing Ray applications.