Home / Companies / Anyscale / Blog / Post Details
Content Deep Dive

New: Joins & Hash-Shuffle in Ray Data

Blog post from Anyscale

Post Details
Company
Date Published
Author
Alexey Kudinkin, Praveen Gorthy and Richard Liaw
Word Count
1,054
Language
English
Hacker News Points
-
Summary

Ray Data has made significant improvements, including native join support via the `ds.join()` API, key-based repartitioning with `repartition(key=...)`, and a new custom aggregation API with `AggregateFnV2`. A hash-based shuffle backend powers joins, improving performance for repartitioning and aggregations, reducing memory pressure compared to previous implementations. The new backend works by hashing keys and partitioning data based on these hashes, allowing for more efficient shuffling and joining of records. Benchmarks show substantial improvements in runtime and reduced memory pressure for various workloads, including preprocessing, TPC-H Q1 SF100, and TPC-H Q1 SF1000. The hash-based shuffle backend also enables better performance for repartitioning and aggregations, reducing peak memory usage by up to 3.9x. Looking ahead, Ray Data plans to support different types of joins, logical plan optimizations, and further improvements for data preprocessors.