Deriving the DPO Loss from First Principles
Blog post from HuggingFace
Aayush Garg's article presents Direct Preference Optimization (DPO) as a simplified alternative to Proximal Policy Optimization (PPO) used in Reinforcement Learning from Human Feedback (RLHF) for large language models (LLMs). Unlike PPO, which requires a complex multi-step pipeline involving reward modeling and reinforcement learning, DPO directly optimizes LLMs to align with human preferences using a supervised classification loss on preference pairs without explicit reward modeling or sampling during training. By leveraging the Bradley-Terry model for preference learning, DPO reformulates the optimization problem, implicitly optimizing the same objective as PPO-based RLHF—reward maximization with a KL-divergence constraint—by using the policy's log-ratio with a reference model to define an implicit reward function. This approach eliminates the need for reinforcement learning algorithms and value functions, making the optimization process computationally lightweight and straightforward, while still maintaining the goal of optimizing the KL-constrained reward maximization objective.