How We Boosted GPU Utilization by 40% with Redis & Lua

Post Details

Company

Galileo

Date Published

Nov. 24, 2025

Author

Lev Neiman

Word Count

2,513

Language

English

Hacker News Points

-

Source URL

galileo.ai/blog/how-we-boosted-gpu-utilization-by-40-with-redis-lua

Summary

Galileo developed the Luna-2 small language models to provide real-time AI evaluations with millisecond latency, ensuring application safety without performance degradation. To optimize GPU utilization and reduce latency, they implemented a client-side load-aware balancer using Redis, which improved average GPU utilization by approximately 40% and reduced tail latency by 70%. Traditional load balancers failed to accommodate the varying execution times of GPU inference workloads, prompting the switch to client-side load balancing that allowed clients to choose the least busy GPU, thus evenly distributing workloads. The system leverages Redis's atomic operations and Lua scripting to maintain an accurate, real-time view of GPU loads, ensuring efficient request routing and failure handling. The implementation led to significant latency reductions, especially for larger input sizes, and demonstrated that client-side load balancing with fast shared state solutions like Redis can enhance GPU inference performance without complex infrastructure changes.