New: Joins & Hash-Shuffle in Ray Data

Company

Anyscale

Date Published

May 20, 2025

Author

Alexey Kudinkin, Praveen Gorthy and Richard Liaw

Word count

1054

Language

English

Hacker News points

None

URL

www.anyscale.com/blog/ray-data-joins-hash-shuffle

Summary

Ray Data has made significant improvements, including native join support via the `ds.join()` API, key-based repartitioning with `repartition(key=...)`, and a new custom aggregation API with `AggregateFnV2`. A hash-based shuffle backend powers joins, improving performance for repartitioning and aggregations, reducing memory pressure compared to previous implementations. The new backend works by hashing keys and partitioning data based on these hashes, allowing for more efficient shuffling and joining of records. Benchmarks show substantial improvements in runtime and reduced memory pressure for various workloads, including preprocessing, TPC-H Q1 SF100, and TPC-H Q1 SF1000. The hash-based shuffle backend also enables better performance for repartitioning and aggregations, reducing peak memory usage by up to 3.9x. Looking ahead, Ray Data plans to support different types of joins, logical plan optimizations, and further improvements for data preprocessors.