DenseR: Dense Rewards For Free in LLM Reasoning
Blog post from HuggingFace
DenseR is a proposed approach to enhance the Generalized Reward Policy Optimization (GRPO) used in reasoning models by introducing dense, per-token rewards instead of the traditional sparse, per-completion rewards. GRPO typically assigns equal rewards or penalties to all tokens in a completion, regardless of their individual contribution to the correctness of the result. This can lead to inefficiencies, as correct steps can be penalized along with mistakes, and unique strategies are not adequately rewarded. DenseR addresses this by examining the internal representations of the model at each token and using contrastive signals to assign weights to individual tokens, thus focusing rewards and penalties more precisely on the critical steps of reasoning. This approach does not require additional models or annotations and leverages the model's existing hidden states to compute these weights. Experimental results show that DenseR significantly improves performance on challenging benchmarks, particularly for smaller models, by promoting diverse correct solutions and enhancing reasoning capabilities without increasing the inference cost.