Using Model-as-a-Judge for Reward in Reinforcement Fine Tuning

Company

Fireworks AI

Date Published

Oct. 6, 2025

Author

Word count

824

Language

English

Hacker News points

None

URL

fireworks.ai/blog/model-as-judge

Summary

Leveraging a large language model (LLM) as a judge can significantly enhance the performance of policy models in domains that are challenging to quantify, such as creative writing. Using the Fireworks Reinforcement Fine Tuning (RFT) API, the Qwen2.5 32B base model was fine-tuned to achieve a 93.8% win rate on creative writing tasks against its original version. This process involved using the open-source Qwen3 235B model as a judge, which was managed with the Fireworks Build SDK to automatically allocate optimal compute resources. The evaluation methodology employed pairwise comparisons of different rollouts of the same prompt to assign rewards, using a rule-based reward function to assess dimensions like style and coherence. The Arena Hard Auto dataset, which includes creative writing, mathematics, and software engineering tasks, served as the testing ground, and the RFT methodology also showed improvements in more objective domains like mathematics and programming. The study highlights the potential of using LLMs for nuanced evaluation in creative tasks, offering significant improvements over base models.