Company
Date Published
Author
Lev Neiman
Word count
2513
Language
English
Hacker News points
None

Summary

Galileo developed the Luna-2 small language models to provide real-time AI evaluations with millisecond latency, ensuring application safety without performance degradation. To optimize GPU utilization and reduce latency, they implemented a client-side load-aware balancer using Redis, which improved average GPU utilization by approximately 40% and reduced tail latency by 70%. Traditional load balancers failed to accommodate the varying execution times of GPU inference workloads, prompting the switch to client-side load balancing that allowed clients to choose the least busy GPU, thus evenly distributing workloads. The system leverages Redis's atomic operations and Lua scripting to maintain an accurate, real-time view of GPU loads, ensuring efficient request routing and failure handling. The implementation led to significant latency reductions, especially for larger input sizes, and demonstrated that client-side load balancing with fast shared state solutions like Redis can enhance GPU inference performance without complex infrastructure changes.