Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Agentic RL: Token-In, Token-Out Done Right

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Quentin Gallouédec and Kashif Rasul
Word Count
3,670
Language
-
Hacker News Points
-
Summary

The article explores the challenges and solutions associated with training large language models (LLMs) using reinforcement learning (RL), emphasizing the importance of maintaining the Token-In, Token-Out (TITO) invariant. It highlights the pitfalls of re-tokenizing model outputs, which can lead to unreliable gradient signals due to non-reversible tokenization processes. The recommended solution is to avoid re-encoding decoded tokens, using a buffer to keep track of the model's sampled tokens, thus maintaining structural integrity and preventing token drift. The article further discusses methods to ensure chat templates are prefix-preserving for tool messages, which is crucial for maintaining the consistency of the training loop. It contrasts two approaches: a lighter, more generic TITO loop and a heavier model-specific renderer, each with its advantages. The piece concludes by emphasizing the need to understand and verify the prefix-preservation property of chat templates for effective model training without re-implementing templating logic.