Company
Date Published
Author
Varun Shenoy, Philip Kiely
Word count
3038
Language
English
Hacker News points
113

Summary

We want to use the full power of our GPU during LLM inference to maximize performance, but first we need to determine if our inference is compute bound or memory bound. Calculating the operations per byte possible on a given GPU and comparing it to the arithmetic intensity of our model's attention layers reveals where the bottleneck is: compute or memory. By understanding this, we can optimize our GPU usage and make the most of our compute capacity. Batching is a practical strategy for optimizing memory-bound inference, as it increases the model's arithmetic intensity by doing more computation for the same number of loads and stores from memory. Evaluating GPUs for LLM inference requires considering factors like latency sensitivity, batch size, and communication costs to choose the best GPU for our use case. Understanding the math behind profiling transformer inference is essential to controlling costs and improving performance during model serving, and real-world benchmarks can help account for factors that theoretical calculations may overlook.