How to run Llama 3.1 as an API

Company

Modal

Date Published

Sept. 18, 2024

Author

Kenny Ning

Word count

396

Language

English

Hacker News points

None

URL

modal.com/blog/llama-3-1-api

Summary

Llama 3.1 is Meta's latest family of large language models that are quickly becoming the standard in the open-source LLM space, available in three sizes and a fine-tuned Instruct version optimized for instructions and dialogue. Serving Llama 3.1 as an API requires significant compute, especially with the 405B version, but can be done on Modal's serverless compute platform using GPUs like A100s and H100s while only paying for what you use. The process involves creating a Modal account, cloning the examples repo, and adjusting GPU VRAM settings accordingly. Pricing is usage-based, with automatic spinning down and scaling in production. Llama 3.1 offers a generous community license, making it a great choice for fine-tuning and serving as a commercial product. With open-source serving framework vLLM and Modal's compute platform, building a Llama 3.1 API for production-grade LLM inference is easy at a cost-effective price point.