Ray Train V2: Unified Distributed Training on Ray

Post Details

Company

Anyscale

Date Published

Nov. 4, 2025

Author

Matthew Deng

Word Count

824

Language

English

Hacker News Points

-

Source URL

www.anyscale.com/blog/ray-train-v2-unified-distributed-training-on-ray

Summary

Ray Train V2 introduces several enhancements to improve the experience of distributed training, focusing on better usability, reliability, and a cleaner API surface for faster feature development. Key features include asynchronous checkpointing, allowing model checkpoints to be uploaded in a separate CPU thread to maintain high GPU utilization, and asynchronous validation, enabling parallel validation without blocking the training loop. It also introduces a JaxTrainer API for seamless scaling of JAX training on TPUs, providing a single-controller orchestration for greater fault tolerance compared to the traditional multi-controller setup. Additionally, a new local mode allows for streamlined debugging by executing training functions directly in the current process, offering both single-process and multi-process modes. These advancements lay the groundwork for future releases aimed at enhancing fault tolerance, framework integrations, and experiment management.