Direct Preference Optimization

Company

Together AI

Date Published

April 17, 2025

Author

Ivan Provilkov, Zain Hasan, Max Ryabinin

Word count

1472

Language

English

Hacker News points

URL

www.together.ai/blog/direct-preference-optimization

Summary

The Together Fine-Tuning Platform now supports Direct Preference Optimization (DPO), a technique to align language models with human preferences, creating more helpful and accurate AI assistants. DPO allows training directly on preference data without an intermediate reward model, simplifying the process compared to traditional approaches like Reinforcement Learning from Human Feedback (RLHF). This method refines how capabilities are expressed in the model, improving its generation quality and alignment with human values. DPO is ideal for tasks where prompting isn't sufficient or when humans can compare better than create, making controlled improvements to existing models more efficient. The technique excels in tasks with nuanced quality judgments, but may not be suitable for single correct answers or tasks with objectively correct answers. To get started with DPO on Together, developers need to tune key hyperparameters like --dpo-beta and monitor training metrics specific to preference optimization.