DPO, your simplest RL pipeline with two rollouts
Blog post from Fireworks AI
Fireworks RFT introduces a method for fine-tuning large language models (LLMs) using Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), two techniques that enhance model responses by comparing preferred and dispreferred outputs. The blog post explores a simplified approach to GRPO, aligning it closely with DPO through theoretical and practical analysis, thus enabling the creation of a continuous model training pipeline that mimics Reinforcement Learning (RL) processes. This approach involves using a dataset where each prompt has two responses with one preferred over the other, allowing the model to learn by increasing the probability of preferred responses. The blog suggests that this methodology can be applied in real-world scenarios, such as customer support bots, to improve model performance over time through user feedback and continuous training. Fireworks.ai supports this process with tools and APIs that facilitate the implementation of recurring training workflows, promoting ongoing model enhancement.