Home / Companies / Cloudflare / Blog / Post Details
Content Deep Dive

Running fine-tuned models on Workers AI with LoRAs

Blog post from Cloudflare

Post Details
Company
Date Published
Author
Michelle Chen, Logan Grasby
Word Count
2,415
Language
English
Hacker News Points
-
Summary

Inference from fine-tuned LLMs with LoRAs is now in open beta on Workers AI platform. Low-Rank Adaptation (LoRA) is a specific fine-tuning method that can be applied to various model architectures, not just LLMs. It allows for the fine-tune weights and pre-trained model to remain separate, and for the pre-trained model to remain unchanged. The approach of maintaining the original base model weights means that you can create new fine-tune weights with relatively little compute. LoRA is an efficient method of fine-tuning which takes a lot less time and compute to train these additional parameters, which are referred to as a LoRA adapter. This makes it a lot easier to distribute, and serving fine-tuned inference with LoRA only adds ms of latency to total inference time.