Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

FlexGen: High-throughput generative inference of large language models with a single GPU

Blog post from Together AI

Post Details
Company
Date Published
Author
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RĂ©, Ion Stoica, Ce Zhang
Word Count
317
Language
English
Hacker News Points
-
Summary

FlexGen is a high-throughput generative inference engine for large language models (LLMs) designed to run on limited resources, such as a single commodity GPU. It can be configured under various hardware constraints by aggregating memory and computation from the GPU, CPU, and disk, and solves a linear programming problem to find efficient patterns for tensor storage and access. FlexGen compresses weights and attention caches to 4 bits with negligible accuracy loss, enabling larger batch sizes and higher throughput. By achieving this, FlexGen outperforms state-of-the-art offloading systems when running on a single 16GB GPU, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. The code is available online, allowing users to explore its potential for various applications.