Company
Date Published
Author
Alexey Kudinkin, Praveen Gorthy and Richard Liaw
Word count
1054
Language
English
Hacker News points
None

Summary

Ray Data has made significant improvements, including native join support via the `ds.join()` API, key-based repartitioning with `repartition(key=...)`, and a new custom aggregation API with `AggregateFnV2`. A hash-based shuffle backend powers joins, improving performance for repartitioning and aggregations, reducing memory pressure compared to previous implementations. The new backend works by hashing keys and partitioning data based on these hashes, allowing for more efficient shuffling and joining of records. Benchmarks show substantial improvements in runtime and reduced memory pressure for various workloads, including preprocessing, TPC-H Q1 SF100, and TPC-H Q1 SF1000. The hash-based shuffle backend also enables better performance for repartitioning and aggregations, reducing peak memory usage by up to 3.9x. Looking ahead, Ray Data plans to support different types of joins, logical plan optimizations, and further improvements for data preprocessors.