Company
Date Published
Author
Pankaj Gupta, Philip Kiely
Word count
1623
Language
English
Hacker News points
None

Summary

NVIDIA's H100 GPUs offer the most powerful processors on the market but running inference on ML models takes more than raw power. To get the fastest time to first token, highest tokens per second, and lowest total generation time for LLMs and models like Stable Diffusion XL, developers turn to TensorRT, a model serving engine by NVIDIA. By serving models optimized with TensorRT on H100 GPUs, developers unlock substantial cost savings over A100 workloads and outstanding performance benchmarks for both latency and throughput. The H100's increased memory bandwidth has a direct impact on an LLM's performance, resulting in double the throughput vs A100 and 2x improvement in latency with constant batch size for Mistral 7B. Additionally, the H100's faster time to first token from better Tensor compute and 3x better throughput at increased batch sizes make it an attractive option for demanding ML workloads. The H100 offers substantial advantages over A100s, including 1.6x more memory bandwidth and 989.5 teraFLOPs of fp16 tensor compute, but running inference with TensorRT/TensorRT-LLM yields even bigger improvements over the A100 than the stat sheet would suggest.