Company
Date Published
Author
Richard Liaw
Word count
789
Language
English
Hacker News points
None

Summary

We’re excited to share recent developments for Ray in the 2.2 release -- enhanced observability, improved performance for data-intensive AI applications, increased stability, and better UX for RLlib. The Ray community has driven requirements for performance, robustness, and stability, which have been addressed with this new release. Ray Jobs API is now GA, allowing users to submit locally developed applications to a remote Ray Cluster for execution. Improved observability features include the ability to visualize CPU flame graphs of worker processes and additional metrics in the Ray Dashboard. Additionally, RLlib's Algorithms feature flexible fault tolerance for simulation/rollout workers and evaluation workers, and improvements have been made to the Ray Tune library for better checkpoint syncing and retries. The release also focuses on reducing latency and memory footprint for batch prediction, with nearly 50% improved throughput performance and 100x reduced GPU memory footprint. Enhanced support for the ML data ecosystem has also been added, including expanded version compatibility with Apache Arrow and full TensorFlow TF records reading/writing support for Ray Data. Out-of-memory errors have been addressed by enabling the Ray Out-Of-Memory (OOM) Monitor by default, and dynamic block splitting is now enabled by default to address performance issues with large files. Finally, UX improvements have been made to RLlib's command line interface and checkpoint format, making it more cohesive and transparent.