A guide to LLM inference and performance

Company

Baseten

Date Published

Nov. 17, 2023

Author

Varun Shenoy, Philip Kiely

Word count

3038

Language

English

Hacker News points

113

URL

www.baseten.co/blog/llm-transformer-inference-guide

Summary

We want to use the full power of our GPU during LLM inference to maximize performance, but first we need to determine if our inference is compute bound or memory bound. Calculating the operations per byte possible on a given GPU and comparing it to the arithmetic intensity of our model's attention layers reveals where the bottleneck is: compute or memory. By understanding this, we can optimize our GPU usage and make the most of our compute capacity. Batching is a practical strategy for optimizing memory-bound inference, as it increases the model's arithmetic intensity by doing more computation for the same number of loads and stores from memory. Evaluating GPUs for LLM inference requires considering factors like latency sensitivity, batch size, and communication costs to choose the best GPU for our use case. Understanding the math behind profiling transformer inference is essential to controlling costs and improving performance during model serving, and real-world benchmarks can help account for factors that theoretical calculations may overlook.