Company
Date Published
Author
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RĂ©, Ion Stoica, Ce Zhang
Word count
317
Language
English
Hacker News points
None

Summary

FlexGen is a high-throughput generative inference engine for large language models (LLMs) designed to run on limited resources, such as a single commodity GPU. It can be configured under various hardware constraints by aggregating memory and computation from the GPU, CPU, and disk, and solves a linear programming problem to find efficient patterns for tensor storage and access. FlexGen compresses weights and attention caches to 4 bits with negligible accuracy loss, enabling larger batch sizes and higher throughput. By achieving this, FlexGen outperforms state-of-the-art offloading systems when running on a single 16GB GPU, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. The code is available online, allowing users to explore its potential for various applications.