Fine-tuning open LLM judges to outperform GPT-5.2

Post Details

Company

Together AI

Date Published

Feb. 2, 2026

Author

Zain Hasan, Jasmine Li, Ivan Provilkov

Word Count

2,468

Language

English

Hacker News Points

-

Source URL

www.together.ai/blog/fine-tuning-open-llm-judges-to-outperform-gpt-5-2

Summary

The text discusses the use of preference optimization to train open-source language models (LLMs) that outperform GPT-5.2 in aligning with human preferences, using the Reward Bench 2 benchmark. The study highlights that models like GPT-OSS 120B and Qwen3 235B can be fine-tuned to match or surpass GPT-5.2 in human preference alignment through Direct Preference Optimization (DPO), a method that optimizes models based on preference pairs. The concept of LLM-as-a-judge is explored, where LLMs are used to evaluate other LLM outputs by focusing on simpler classification tasks, such as determining which response is better or if a text contains harmful content. The experiment reveals that while Qwen3 235B outperforms GPT-5.2 without tuning, GPT-OSS 120B shows significant improvement post-fine-tuning, particularly in math and subjective response quality. The analysis underscores the cost-effectiveness and flexibility of open-source models, which offer transparency and lower costs compared to closed-source alternatives like GPT-5.2, making them a promising option for production evaluation systems.