How much VRAM do I need for LLM inference?

Company

Modal

Date Published

Sept. 1, 2024

Author

Yiren Lu

Word count

261

Language

English

Hacker News points

None

URL

modal.com/blog/how-much-vram-need-inference

Summary

A rule of thumb for large language models is approximately 2GB of GPU memory per 1 billion parameters in the model, which can help estimate the required GPU memory. When loading a model in "half precision" (16-bit), this ratio increases to around 140GB for a 70B model, indicating that a single A100 80GB GPU may not be enough but two A100 GPUs could suffice. Quantization reduces the amount of GPU memory needed by reducing the precision of the model's weights, with common levels including 16-bit (half-precision), 8-bit, and 4-bit. The formula M = (P x (Q/8)) x 1.2 can be used to calculate the required GPU memory for a model with quantization, considering the number of parameters, bits used for loading the model, and an additional 20% overhead for tasks like key-value caching.