Llama 3.1 is Meta's latest family of large language models that are quickly becoming the standard in the open-source LLM space, available in three sizes and a fine-tuned Instruct version optimized for instructions and dialogue. Serving Llama 3.1 as an API requires significant compute, especially with the 405B version, but can be done on Modal's serverless compute platform using GPUs like A100s and H100s while only paying for what you use. The process involves creating a Modal account, cloning the examples repo, and adjusting GPU VRAM settings accordingly. Pricing is usage-based, with automatic spinning down and scaling in production. Llama 3.1 offers a generous community license, making it a great choice for fine-tuning and serving as a commercial product. With open-source serving framework vLLM and Modal's compute platform, building a Llama 3.1 API for production-grade LLM inference is easy at a cost-effective price point.