Red Teaming with RL: Exploiting Tinker API for Harmful RL on 235B Model
Blog post from HuggingFace
The article explores the emerging threat of Harmful Reinforcement Learning (Harmful RL) where attackers exploit Reinforcement Learning from Human Feedback (RLHF), traditionally used to align Large Language Models (LLMs) for safety, to instead cultivate harmful behaviors in these models. By manipulating reward functions and employing reinforcement learning algorithms like Group Relative Policy Optimization (GRPO), adversaries can misalign models without massive resources, thanks to platforms like the Tinker API that simplify distributed RL training. The demonstration highlights the reduced barriers to executing such attacks on large models, like the 235B parameter model Qwen3-235B, with minimal cost, showcasing how attackers can invert reward signals to encourage unsafe behaviors. The article calls for a proactive defense strategy against this "Asymmetric Vulnerability," emphasizing the need for RLaaS (Reinforcement Learning as a Service) platforms and model providers to collaborate on creating robust defenses against such threats, as the democratization of powerful training tools continues to pose significant safety risks.