Scalable Select of Random Rows in SQL

Post Details

Company

Cube

Date Published

March 22, 2018

Author

Pavel Tiunov

Word Count

1,063

Language

English

Hacker News Points

-

Source URL

cube.dev/blog/select-random-rows-sql

Summary

The text discusses the performance differences between traditional web analytics tools like Google Analytics and data warehouses when handling large datasets, with a focus on how sampling techniques can address these differences. It explains the basics of sampling, which involves selecting a subset of data to estimate population properties, and highlights the importance of using unique user identifiers for accurate results. The text outlines two primary methods for selecting random rows in SQL: simple random sampling and systematic sampling, favoring the latter for its implementation simplicity in SQL environments. For sequence-generated user identifiers, systematic sampling can be executed using the MOD operation, whereas for string or non-sequential identifiers, hash functions such as FARM_FINGERPRINT in BigQuery are recommended to achieve uniform distribution. The document also cautions against potential pitfalls like sampling bias, especially in cases involving rare events, and encourages more sophisticated techniques when necessary to maintain precision.