Introducing Direct Preference Optimization (DPO) Support on OpenPipe
Blog post from OpenPipe
OpenPipe has introduced Direct Preference Optimization (DPO) support, allowing users to align models with their specific requirements more strongly. DPO is an advanced fine-tuning method that enables models to learn directly from preference data, making it useful when users have a source of preference data that they can exploit. This technique is particularly effective when used in conjunction with user-defined criteria and has shown promising results in initial tests, such as reducing the number of responses exceeding word limits by 77% or dropping hallucinated information by 76%. To get started with DPO on OpenPipe, users need to prepare their preference data, upload it to the platform, select the DPO option during fine-tuning job configuration, and launch their training run. The company is also working on integrating DPO into an online learning workflow to enable continual learning.