The Together Fine-Tuning Platform now supports Direct Preference Optimization (DPO), a technique to align language models with human preferences, creating more helpful and accurate AI assistants. DPO allows training directly on preference data without an intermediate reward model, simplifying the process compared to traditional approaches like Reinforcement Learning from Human Feedback (RLHF). This method refines how capabilities are expressed in the model, improving its generation quality and alignment with human values. DPO is ideal for tasks where prompting isn't sufficient or when humans can compare better than create, making controlled improvements to existing models more efficient. The technique excels in tasks with nuanced quality judgments, but may not be suitable for single correct answers or tasks with objectively correct answers. To get started with DPO on Together, developers need to tune key hyperparameters like --dpo-beta and monitor training metrics specific to preference optimization.