Faster Mixtral inference with TensorRT-LLM and quantization

Company

Baseten

Date Published

Dec. 22, 2023

Author

Pankaj Gupta, Timur Abishev, Philip Kiely

Word count

1467

Language

English

Hacker News points

URL

www.baseten.co/blog/faster-mixtral-inference-with-tensorrt-llm-and-quantization

Summary

Mixtral 8x7B is an LLM that produces results comparable to larger models like Llama 2 70B and GPT-3.5, but with fewer parameters and enabling faster inference. Using TensorRT-LLM and quantizing the model to int8 achieves important performance milestones while using only a single A100 GPU. Mixtral's mixture of experts architecture uses only 12.9B parameters during inference, and its performance is better for individual requests than larger models. However, batching inference does not take full advantage of this architecture, resulting in decreased throughput. Quantizing the model to int8 cuts inference cost in half while preserving quality with a minimal increase in perplexity. The use of TensorRT-LLM unlocks faster single-request and batched inference performance, making Mixtral suitable for a wide range of use cases.