How to run TorchForge reinforcement learning pipelines in the Together AI Native Cloud
Blog post from Together AI
The AI Native Cloud is advancing reinforcement learning (RL) systems by providing a flexible and scalable infrastructure that supports modern RL pipelines, which require more than just simple training loops. Utilizing the full PyTorch stack, including TorchForge and Monarch, it offers distributed training capabilities on Together Instant Clusters, optimized for low-latency GPU communication and consistent cluster setup. These clusters accommodate heterogeneous RL workloads by efficiently managing GPU and CPU resources, and support complex RL frameworks that combine GPU-bound computations with CPU-bound tasks. Together AI also integrates tools such as CodeSandbox for microVM environments and Code Interpreter for isolated Python execution, facilitating tool-use, coding tasks, and simulations. A demonstration showcases a TorchForge RL pipeline operating on these clusters, training a model to play Blackjack, highlighting the adaptability of the system to different models and tasks. This setup paves the way for a flexible, open RL framework in the PyTorch ecosystem, aiming to deliver high-performance RL services on the Together AI Cloud, with ongoing collaborations and developments in partnership with Meta.