Aggregating Millions of Groups Fast in Apache Arrow DataFusion

Post Details

Company

InfluxData

Date Published

Aug. 1, 2023

Author

Andrew Lamb

Word Count

2,309

Language

English

Hacker News Points

-

Source URL

www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion

Summary

Apache Arrow DataFusion's new parallel aggregation capability has improved performance by 2-3x for queries with a large number of groups, reaching near-DuckDB-speeds querying Parquet data. This improvement is significant for developers building products and projects with DataFusion, allowing them to spend more time on value-added domain-specific features. The new optimization reduces allocations, uses contiguous native accumulator states, and vectorized state updates, leading to improved performance for high cardinality groups. DataFusion's community effort has led to the development of this improvement, which is part of the next generation "Deconstructed Database" architectures using fast, modular components.