Serve 100+ Fine-Tuned LLMs with LoRA Exchange on One GPU

Post Details

Company

Predibase

Date Published

Oct. 18, 2023

Author

Travis Addair and Geoffrey Angus

Word Count

2,531

Language

English

Hacker News Points

-

Source URL

predibase.com/blog/lora-exchange-lorax-serve-100s-of-fine-tuned-llms-for-the-cost-of-one

Summary

Predibase has developed a new infrastructure called LoRA Exchange (LoRAX) to efficiently serve multiple fine-tuned language models (LLMs) using shared GPU resources, addressing the cost and resource inefficiencies associated with deploying separate GPU resources for each model. LoRAX employs techniques such as Dynamic Adapter Loading, Tiered Weight Caching, and Continuous Multi-Adapter Batching to load fine-tuned model parameters only as needed, reduce memory usage by offloading weights to CPU and disk, and optimize request throughput across multiple models. This approach allows users to pack up to 100 specialized models into a single deployment, making it cost-effective compared to conventional methods. The system is integrated with Predibase's infrastructure, which simplifies the process of fine-tuning and deploying models using the open-source Ludwig framework, and is available for free trial. LoRAX, now open-sourced, enables organizations to efficiently deploy task-specific LLMs, leveraging fine-tuning to enhance performance for specific applications without the high costs typically associated with serving such models individually.