ML Intern Takes Our Post-Training Internship Test
Blog post from HuggingFace
In a post-training exercise, the ML intern replicated a HuggingFace internship test to explore Best-of-N Weighted Selection on MATH-500 problems. The study involved sampling multiple solutions from a large language model (LLM) and scoring each using a Process Reward Model (PRM), selecting the solution with the highest total weighted score. The Weighted Best-of-N approach demonstrated superior accuracy compared to greedy and standard methods, with improvements noted as more solutions were sampled. Key findings included that weighted selection overcomes the limitations of single high-scoring incorrect solutions by aggregating evidence across multiple correct solutions, as seen in specific number theory problems. The report highlighted the effectiveness of PRM in distinguishing correct from incorrect solutions and suggested that accounting for formatting differences could further enhance accuracy. The methodology was supported by co-authored code with contributions to pipeline structure, model loading, and voting implementation, alongside comprehensive results and analysis.