High performance ML inference with NVIDIA TensorRT

Company

Baseten

Date Published

March 12, 2024

Author

Justin Yi, Philip Kiely

Word count

1076

Language

English

Hacker News points

None

URL

www.baseten.co/blog/high-performance-ml-inference-with-nvidia-tensorrt

Summary

TensorRT is a software development kit for high-performance deep learning inference, offering significant performance gains through optimization at the CUDA level on compiled models. To use TensorRT in production, one needs to know their compute needs and traffic patterns, as well as choose a supported model and GPU architecture. Optimizing model weights with TensorRT can result in 40% lower latency and 3x higher throughput for large language models like Mixtral 8x7B, and even more impressive gains on larger GPUs like the H100. By working closely with NVIDIA engineers and leveraging best practices, developers can achieve world-class performance on latency and throughput sensitive tasks.