DenseR: Dense Rewards For Free in LLM Reasoning

Post Details

Company

Hugging Face

Date Published

Feb. 18, 2026

Author

Hritik Bansal

Word Count

3,977

Company Posts That Month

55

Language

-

Hacker News Points

-

Post removed?

No

Source URL

huggingface.co/blog/hbXNov/denser

Summary

DenseR is a proposed approach to enhance the Generalized Reward Policy Optimization (GRPO) used in reasoning models by introducing dense, per-token rewards instead of the traditional sparse, per-completion rewards. GRPO typically assigns equal rewards or penalties to all tokens in a completion, regardless of their individual contribution to the correctness of the result. This can lead to inefficiencies, as correct steps can be penalized along with mistakes, and unique strategies are not adequately rewarded. DenseR addresses this by examining the internal representations of the model at each token and using contrastive signals to assign weights to individual tokens, thus focusing rewards and penalties more precisely on the critical steps of reasoning. This approach does not require additional models or annotations and leverages the model's existing hidden states to compute these weights. Experimental results show that DenseR significantly improves performance on challenging benchmarks, particularly for smaller models, by promoting diverse correct solutions and enhancing reasoning capabilities without increasing the inference cost.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	5	5,138	781	181	+34%
Reinforcement learning	1	122	54	33	-15%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.