Home / Companies / Tiger Data / Blog / Post Details
Content Deep Dive

SkipScan in TimescaleDB: Why DISTINCT Was Slow, How We Built It, and How You Can Use It

Blog post from Tiger Data

Post Details
Company
Date Published
Author
Natalya Aksman
Word Count
2,569
Language
English
Hacker News Points
-
Summary

SkipScan is an optimization technique developed for TimescaleDB to address the inefficiencies of the DISTINCT operation in PostgreSQL, particularly in large-scale databases. Traditionally, executing queries with DISTINCT involves scanning and deduplicating every qualifying row, which can be time-consuming when dealing with millions of entries. SkipScan leverages the structure of B-trees to jump directly from one distinct value to the next, significantly reducing the computational cost from O(N) to O(K × log N), where K is the number of distinct values. This approach is especially beneficial when K is much smaller than N, allowing for millisecond-level response times. Initially introduced in TimescaleDB 2.2.0 for rowstores, SkipScan was later extended to columnstore hypertables and to handle distinct aggregates like COUNT(DISTINCT …) with the release of TimescaleDB 2.20.0. The technique further evolved with the addition of multi-column SkipScan in version 2.22.0, provided that all column values are not-null. By strategically designing schema layouts and indexes, SkipScan enables efficient query execution without necessitating changes to existing applications, thus ensuring faster deduplication and streamlined query performance across vast datasets.