LLM Inference Benchmarking - Measure What Matters
Blog post from DigitalOcean
LLM inference benchmarking is a complex process that involves optimizing hardware and software to achieve optimal performance and cost efficiency. It highlights the need for co-design between hardware components like GPUs and software layers, especially given the variability across GPU providers such as NVIDIA and AMD. The inference process is divided into two phases: prefill, which is compute-bound, and decode, which is memory-bound, each requiring different optimization strategies. Key metrics such as Time to First Token (TTFT), Time per Output Token (TPOT), Inter Token Latency (ITL), End to End Latency (E2EL), Token Throughput (TPS), and Request Throughput (RPS) are essential in evaluating performance. The Pareto frontier serves as a guide for balancing latency, throughput, and concurrency, helping AI teams optimize their systems for specific workloads while considering cost-per-token and total cost of ownership (TCO). Continuous benchmarking and micro-benchmarking are crucial for identifying bottlenecks and pushing performance limits, with a focus on hardware-software alignment to navigate trade-offs between quantization, accuracy, and unit economics.