33% faster LLM inference with FP8 quantization

Company

Baseten

Date Published

March 14, 2024

Author

Pankaj Gupta, Philip Kiely

Word count

1876

Language

English

Hacker News points

None

URL

www.baseten.co/blog/33-faster-llm-inference-with-fp8-quantization

Summary

The text discusses the benefits of quantizing large language models (LLMs) like Mistral 7B using a data format called FP8, which offers faster inference performance while maintaining output quality comparable to the original FP16 model. The authors used a pre-release library compatible with the TensorRT-LLM ecosystem to quantize the model and found significant improvements in latency, throughput, and cost per million tokens. They validated the output quality using both quantitative (perplexity benchmark) and qualitative checks, ensuring that the quantized model's performance meets production requirements. Benchmarks showed that FP8 offers a 33% improvement in speed, an 8.5% decrease in latency, and a 24% reduction in cost per million tokens compared to the original FP16 model. The authors also explored how batch size and sequence length affect performance and provided guidance on selecting optimal configurations for specific use cases.