Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

Fine-tuning open LLM judges to outperform GPT-5.2

Blog post from Together AI

Post Details
Company
Date Published
Author
Zain Hasan, Jasmine Li, Ivan Provilkov
Word Count
2,468
Language
English
Hacker News Points
-
Summary

The text discusses the use of preference optimization to train open-source language models (LLMs) that outperform GPT-5.2 in aligning with human preferences, using the Reward Bench 2 benchmark. The study highlights that models like GPT-OSS 120B and Qwen3 235B can be fine-tuned to match or surpass GPT-5.2 in human preference alignment through Direct Preference Optimization (DPO), a method that optimizes models based on preference pairs. The concept of LLM-as-a-judge is explored, where LLMs are used to evaluate other LLM outputs by focusing on simpler classification tasks, such as determining which response is better or if a text contains harmful content. The experiment reveals that while Qwen3 235B outperforms GPT-5.2 without tuning, GPT-OSS 120B shows significant improvement post-fine-tuning, particularly in math and subjective response quality. The analysis underscores the cost-effectiveness and flexibility of open-source models, which offer transparency and lower costs compared to closed-source alternatives like GPT-5.2, making them a promising option for production evaluation systems.