Running Llama 3 8B with TensorRT-LLM on Serverless GPUs

Post Details

Company

Cerebrium

Date Published

May 16, 2024

Author

Michael Louis

Word Count

1,410

Language

English

Hacker News Points

-

Source URL

www.cerebrium.ai/blog/running-llama-3-8b-with-tensorrt-llm-on-serverless-gpus

Summary

The tutorial guides the reader through implementing the TensorRT-LLM framework on the Cerebrium platform to serve Llama 3 8B model, optimizing machine learning models for inference and achieving significant improvements in performance. The process involves setting up a Cerebrium account, installing required packages, and writing initial code to download the model, convert it to TensorRT-LLM format, build the engine, and deploy the application. The reader can achieve ~1700 output tokens per second on a single Nvidia A10 instance, with potential for further improvements through speculative sampling or FP8 quantization.