S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit

Company

OpenPipe

Date Published

Jan. 17, 2024

Author

Kyle Corbitt

Word count

793

Language

English

Hacker News points

URL

openpipe.ai/blog/s-lora

Summary

S-LoRA is an optimization technique for running thousands of separate large language models (LLMs) simultaneously on a single GPU, addressing the cold-start problem that occurs when infrequently-used models are loaded. This approach leverages fine-tuning methods like LoRA to reduce the required memory and computational resources, enabling efficient serving of multiple task-specific fine-tuned models. By utilizing a tiered caching architecture and custom CUDA kernels, S-LoRA can load many adapters from the same base model onto one GPU, improving throughput and reducing response times, making it possible to deploy many small specialist models efficiently.