Seeking the Perfect Apache Druid Rollup
Blog post from Rill
Apache Druid's rollup feature is designed to optimize storage and improve query performance by compressing multiple data entries into a single row where dimensions are identical and metrics can be pre-computed. To use rollups effectively, several key concepts must be understood: rollups combine rows with identical facts, with generalization improving potential; timestamp granularity affects rollup opportunities, with Apache Druid offering various granularity settings; not all aggregations, such as averages, can be pre-computed at ingestion, but workarounds exist; ingestion metrics like count are crucial for accurate record counting; Apache DataSketches project provides solutions for high-cardinality problems without significantly increasing storage; custom transformations on the __time dimension allow for specific granularity adjustments; and best-effort rollups are essential for real-time data ingestion, especially with Kafka integration. Apache Druid's compaction feature further enhances rollup efficiency by reducing segment numbers and performing additional rollups, making it a powerful tool for managing large-scale data efficiently.