vLLM V0 to V1: Correctness Before Corrections in RL
Blog post from HuggingFace
The transition from vLLM V0 to V1 in PipelineRL focused on ensuring inference correctness before applying any reinforcement learning (RL) objective corrections. This migration aimed to eliminate discrepancies in how token log probabilities were computed, which are crucial for training dynamics. Key issues addressed included logprob semantics, runtime defaults, inflight weight updates, and the precision of the final projection, with each fix aiming to align V1's behavior with the V0 reference. The backend corrections ensured the V1 engine returned logprobs and runtime behavior that matched trainer expectations, preventing the need for premature objective-side corrections that could obscure training outcomes. The successful migration underscored the importance of verifying backend correctness before implementing additional RL objective improvements.