Benchmarking vLLM, SGLang and TensorRT for Llama 3.1 API
Blog post from Cerebrium
Cerebrium conducted a benchmark test on the LLaMA 3.1 70B FP8 model across three popular frameworks—vLLM, TensorRT, and SGLang—focusing on Time To First Token (TTFT) and throughput using a single H100 GPU. vLLM emerged as the best for low-latency applications with a TTFT of 123ms due to its "token stream" approach, which reduces latency by overlapping computation and communication. TensorRT, leveraging mixed-precision and quantization techniques, offers efficient inference for large language models, while SGLang, with its dynamic workload distribution and GPU optimization, excelled in throughput, achieving 460 tokens per second at a batch size of 64. The benchmark highlights that the choice of framework depends on specific needs, whether prioritizing latency or throughput, and suggests configurations such as multi-GPU setups for optimal performance.