SkipScan in TimescaleDB: Why DISTINCT Was Slow, How We Built It, and How You Can Use It
Blog post from Tiger Data
SkipScan is an optimization technique developed for TimescaleDB to address the inefficiencies of the DISTINCT operation in PostgreSQL, particularly in large-scale databases. Traditionally, executing queries with DISTINCT involves scanning and deduplicating every qualifying row, which can be time-consuming when dealing with millions of entries. SkipScan leverages the structure of B-trees to jump directly from one distinct value to the next, significantly reducing the computational cost from O(N) to O(K × log N), where K is the number of distinct values. This approach is especially beneficial when K is much smaller than N, allowing for millisecond-level response times. Initially introduced in TimescaleDB 2.2.0 for rowstores, SkipScan was later extended to columnstore hypertables and to handle distinct aggregates like COUNT(DISTINCT …) with the release of TimescaleDB 2.20.0. The technique further evolved with the addition of multi-column SkipScan in version 2.22.0, provided that all column values are not-null. By strategically designing schema layouts and indexes, SkipScan enables efficient query execution without necessitating changes to existing applications, thus ensuring faster deduplication and streamlined query performance across vast datasets.