Agentic RL: Token-In, Token-Out Done Right
Blog post from HuggingFace
The article explores the challenges and solutions associated with training large language models (LLMs) using reinforcement learning (RL), emphasizing the importance of maintaining the Token-In, Token-Out (TITO) invariant. It highlights the pitfalls of re-tokenizing model outputs, which can lead to unreliable gradient signals due to non-reversible tokenization processes. The recommended solution is to avoid re-encoding decoded tokens, using a buffer to keep track of the model's sampled tokens, thus maintaining structural integrity and preventing token drift. The article further discusses methods to ensure chat templates are prefix-preserving for tool messages, which is crucial for maintaining the consistency of the training loop. It contrasts two approaches: a lighter, more generic TITO loop and a heavier model-specific renderer, each with its advantages. The piece concludes by emphasizing the need to understand and verify the prefix-preservation property of chat templates for effective model training without re-implementing templating logic.