Native random sampling in ClickHouse
Blog post from ClickHouse
ClickHouse's native random sampling feature allows users to execute queries on a fraction of their data, providing faster query times while maintaining a reasonable level of accuracy. By using the UK house prices dataset with over 30 million transactions, the process involves creating a table with a suitable sample key, such as the sipHash64 function applied to high-cardinality columns like postcode combinations, to ensure an even distribution of the sampled data. The approach demonstrates how to leverage sampling for both fractional and row count-based queries, highlighting the benefits of reduced processing time and resource usage. To optimize results, the sampling key should be included at the beginning of the ORDER BY clause, and sum aggregations should be scaled using the _sample_factor virtual column. This method is particularly effective for exploratory data analysis where approximate answers are sufficient, offering an efficient trade-off between accuracy and performance.