Company
Date Published
Author
Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen
Word count
616
Language
English
Hacker News points
None

Summary

We introduce Sequoia, a scalable, robust, and hardware-aware speculative decoding framework that improves large language model inference speed on consumer GPUs with offloading, as well as high-end GPUs without any approximations. By creating large trees of speculated tokens, Sequoia achieves significant speedups, reaching an average time between tokens (TBT) of 0.57s on a single RTX-4090, outperforming highly optimized offloading systems and DeepSpeed-Zero-Inference. On high-end GPUs, Sequoia improves decoding speeds by up to 4.04x for larger models, making it suitable for various model sizes and hardware configurations. The framework leverages dynamic programming algorithms, sampling without replacement, and a hardware-aware optimizer to select optimal tree sizes and depths for each hardware configuration, providing robustness and scalability. With Sequoia, users can host large language models like 70B on consumer GPUs without approximations, boosting AI-generated content applications, and achieving significant speedups on high-end GPUs in the small-batch setting.