Home / Companies / Anyscale / Blog / Post Details
Content Deep Dive

Direct Preference Optimization with Synthetic Data on Anyscale

Blog post from Anyscale

Post Details
Company
Date Published
Author
Franklin Wang, Sumanth Hegde, Kourosh Hakhamaneshi
Word Count
9,249
Language
English
Hacker News Points
1
Summary

In this post, we explore preference tuning of LLMs through a practical case study on summarization, using Ray and Anyscale as our compute platform. We applied Direct Preference Optimization (DPO) to the Mistral-7B-Instruct-v0.1 model to produce good summaries for CNN articles. Our results show that DPO is effective in tackling specific domains such as summarization where there is no ground-truth response, and it can achieve much higher win-rates than using supervised fine-tuning or prompting GPT-4o. We also found that both β and learning rate are critical for performance and may require a thorough hyperparameter search. Additionally, we demonstrated the effectiveness of regenerating preference training data with the new model and applying additional rounds of DPO to achieve even more gains in performance.