25x Faster Cold Starts for LLMs on Kubernetes

Company

BentoML

Date Published

Aug. 14, 2025

Author

Word count

1655

Language

English

Hacker News points

None

URL

www.bentoml.com/blog/25x-faster-cold-starts-for-llms-on-kubernetes

Summary

The blog post discusses the challenges and solutions for reducing cold start times of large language models (LLMs) on Kubernetes, specifically focusing on the Llama 3.1 8B container which initially took 10 minutes to start. It highlights the importance of fast cold starts for dynamic scaling and cost efficiency, and details a strategy that achieved a 25x improvement in startup speed by reengineering the process of pulling container images and loading model weights into GPU memory. Key innovations include using object storage for faster image retrieval, employing FUSE to bypass the extraction phase, and implementing zero-copy stream-based loading to directly transfer model weights into GPU memory. These optimizations not only cut down startup times significantly but also mitigate bottlenecks related to container registries, storage drivers, and model loading inefficiencies. The post emphasizes that such improvements can lead to reduced infrastructure costs, greater deployment flexibility, and enhanced user experiences for businesses leveraging LLMs.