40% faster Stable Diffusion XL inference with NVIDIA TensorRT

Company

Baseten

Date Published

Feb. 22, 2024

Author

Pankaj Gupta, Justin Yi, Philip Kiely

Word count

2403

Language

English

Hacker News points

None

URL

www.baseten.co/blog/40-faster-stable-diffusion-xl-inference-with-nvidia-tensorrt

Summary

SDXL is a text-to-image model that can generate images with high quality and flexibility. It uses a modular architecture composed of four major components: CLIP, UNet, Refiner, and VAE. The UNet model is the main component of SDXL and runs iteratively in inference steps to create an image representation in latent space. Optimizing the performance of SDXL involves individually optimizing each component in the pipeline using NVIDIA TensorRT, a software development kit for high-performance deep learning inference. The optimization process includes exporting the model pipeline to ONNX, making an optimized engine for serving each sub-model within SDXL, and deploying the optimized models as API endpoints. With TensorRT, SDXL achieves up to 40% lower latency and 70% higher throughput than the unoptimized model on the same hardware, making it viable for high-latency and cost-sensitive use cases. The techniques used can be applied to similar image generation pipelines, including SDXL Turbo, which generates images with even higher quality but at a slightly lower speed.