Run Larger LLMs on Runpod Serverless Than Ever Before â Llama-3 70B (and beyond!)
Blog post from RunPod
Runpod's serverless offering now supports multiple GPUs, enhancing its capability to run large language models (LLMs) with ease. Users can assign two A100 or H100 GPUs or up to ten 24GB or 48GB GPUs to a worker, facilitating the execution of 70 billion parameter models at full precision or nearly any quantized model using the VLLM Quick Deploy template. Setting up involves creating a network volume to store models, reducing cold start times to approximately 600ms for models like Llama-3-70b. Serverless architecture, while requiring more initial setup, offers cost efficiency by billing only for active use and allowing dynamic scaling to handle concurrent requests, providing a smoother user experience compared to fixed pod setups.