/plushcap/analysis/cloudflare/fine-tuned-inference-with-loras

Running fine-tuned models on Workers AI with LoRAs

What's this blog post about?

Inference from fine-tuned LLMs with LoRAs is now in open beta on Workers AI platform. Low-Rank Adaptation (LoRA) is a specific fine-tuning method that can be applied to various model architectures, not just LLMs. It allows for the fine-tune weights and pre-trained model to remain separate, and for the pre-trained model to remain unchanged. The approach of maintaining the original base model weights means that you can create new fine-tune weights with relatively little compute. LoRA is an efficient method of fine-tuning which takes a lot less time and compute to train these additional parameters, which are referred to as a LoRA adapter. This makes it a lot easier to distribute, and serving fine-tuned inference with LoRA only adds ms of latency to total inference time.

Company
Cloudflare

Date published
April 2, 2024

Author(s)
Michelle Chen, Logan Grasby

Word count
2415

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.