Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Company

Together AI

Date Published

March 12, 2024

Author

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

Word count

616

Language

English

Hacker News points

None

URL

www.together.ai/blog/sequoia

Summary

We introduce Sequoia, a scalable, robust, and hardware-aware speculative decoding framework that improves large language model inference speed on consumer GPUs with offloading, as well as high-end GPUs without any approximations. By creating large trees of speculated tokens, Sequoia achieves significant speedups, reaching an average time between tokens (TBT) of 0.57s on a single RTX-4090, outperforming highly optimized offloading systems and DeepSpeed-Zero-Inference. On high-end GPUs, Sequoia improves decoding speeds by up to 4.04x for larger models, making it suitable for various model sizes and hardware configurations. The framework leverages dynamic programming algorithms, sampling without replacement, and a hardware-aware optimizer to select optimal tree sizes and depths for each hardware configuration, providing robustness and scalability. With Sequoia, users can host large language models like 70B on consumer GPUs without approximations, boosting AI-generated content applications, and achieving significant speedups on high-end GPUs in the small-batch setting.