Fast and Accurate GPU Quantization for Transformers

Company

Speechmatics

Date Published

April 28, 2023

Author

Lawrence Atkins

Word count

3633

Language

English

Hacker News points

None

URL

www.speechmatics.com/company/articles-and-news/fast-and-accurate-gpu-quantization-for-transformers

Summary

The Quantization of Transformer models is a critical challenge for deploying state-of-the-art models in a cost-effective way. One popular technique for doing so is quantization, which aims to increase throughput and decrease memory footprint by reducing numerical precision of network parameters and activations. However, this can harm model accuracy if not done carefully. To address this, various techniques such as calibration, quantization-aware training, and operator fusion are explored in the paper. The authors investigate the nuances around achieving peak performance on GPU for INT8 GEMM operations, which are crucial for deploying transformer models efficiently. They also discuss the future of 8-bit quantization with the advent of Nvidia's Hopper/Lovelace architectures, which support a new floating point datatype - FP8. This has both accuracy and performance benefits, and can potentially remove/reduce the need for Quantization-Aware Training (QAT).