Home / Companies / AI21 Labs / Blog / Post Details
Content Deep Dive

Go big or go OOM: the art of scaling vLLM

Blog post from AI21 Labs

Post Details
Company
Date Published
Author
Ella Neiman, Engineering Team Lead
Word Count
2,315
Language
English
Hacker News Points
-
Summary

Efforts to optimize LLM-as-a-Judge (JLM) deployments across multiple concurrent training jobs have focused on reducing GPU underutilization while ensuring the system can handle variable loads without buckling. This challenge was addressed through a two-pronged strategy: optimizing single-node performance and scaling multi-node deployment. The optimization of single-node performance involved tuning the vLLM configuration using automated tools like Auto-Tune vLLM, which led to significant improvements in throughput and latency by adjusting parameters such as sequence length, burst patterns, and tensor parallelism. For scaling multi-node deployments, the implementation of a horizontal scaling strategy allowed for dynamic adjustments based on queue size metrics, ensuring the system could efficiently manage traffic spikes and maintain performance. These strategies proved effective in handling high-throughput inference deployments, applicable to various applications beyond the specific case of JLM serving for GRPO training.