FlexGen: High-throughput generative inference of large language models with a single GPU

Company

Date Published

March 13, 2023

Author

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang

Word count

317

Language

English

Hacker News points

None

URL

www.together.ai/blog/flexgen-high-throughput-generative-inference-of-large-language-models-with-a-single-gpu

Summary

FlexGen is a high-throughput generative inference engine for large language models (LLMs) designed to run on limited resources, such as a single commodity GPU. It can be configured under various hardware constraints by aggregating memory and computation from the GPU, CPU, and disk, and solves a linear programming problem to find efficient patterns for tensor storage and access. FlexGen compresses weights and attention caches to 4 bits with negligible accuracy loss, enabling larger batch sizes and higher throughput. By achieving this, FlexGen outperforms state-of-the-art offloading systems when running on a single 16GB GPU, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. The code is available online, allowing users to explore its potential for various applications.