Home / Companies / RunPod / Blog / Post Details
Content Deep Dive

Run Larger LLMs on Runpod Serverless Than Ever Before – Llama-3 70B (and beyond!)

Blog post from RunPod

Post Details
Company
Date Published
Author
Brendan McKeag
Word Count
639
Language
English
Hacker News Points
-
Summary

Runpod's serverless offering now supports multiple GPUs, enhancing its capability to run large language models (LLMs) with ease. Users can assign two A100 or H100 GPUs or up to ten 24GB or 48GB GPUs to a worker, facilitating the execution of 70 billion parameter models at full precision or nearly any quantized model using the VLLM Quick Deploy template. Setting up involves creating a network volume to store models, reducing cold start times to approximately 600ms for models like Llama-3-70b. Serverless architecture, while requiring more initial setup, offers cost efficiency by billing only for active use and allowing dynamic scaling to handle concurrent requests, providing a smoother user experience compared to fixed pod setups.