The LLM Inference Trilemma: Throughput, Latency, Cost
Blog post from DigitalOcean
Balaji Varadarajan explores the intricate challenge of scaling Large Language Model (LLM) inference, which involves a complex trade-off among throughput, latency, and cost, known as the "trilemma." Unlike traditional web services that can be scaled by simply adding servers, LLM inference is constrained by factors such as memory bandwidth and hardware interconnectivity, making it a stateful process. The article dissects the multi-dimensional concept of cost in LLM inference, which includes capital, operational, opportunity, and engineering costs, and delves into the strategies for optimizing these costs through model architecture, quantization, and parallelism. It emphasizes the importance of understanding workload types to balance between latency-sensitive and throughput-sensitive tasks, using techniques like autoscaling and priority queuing to tailor system performance to specific business needs. The piece ultimately advocates for a workload-aware approach, encouraging rigorous benchmarking and system tuning to navigate the trilemma effectively, rather than relying on standard configurations or superficial benchmarks.