Home / Companies / Statsig / Blog / Post Details
Content Deep Dive

How we created count distinct in Statsig Cloud

Blog post from Statsig

Post Details
Company
Date Published
Author
Aamodit Acharya
Word Count
1,305
Language
English
Hacker News Points
-
Summary

Statsig introduced a new Count Distinct metric on its cloud platform to address customer requests for identifying unique interactions, such as distinct artists listened to or unique brands purchased over time. This feature simplifies the process of defining and computing distinct counts by allowing users to specify events and fields to count, along with optional dimensions, and apply a consistent definition across various tools like Experiments and Pulse. The metric leverages sketches, a probabilistic data structure, to efficiently calculate distinct counts over multiple days while maintaining speed and accuracy, even as data volumes grow. Initially developed in BigQuery, the pipeline was transitioned to Spark for better integration with downstream processes, ensuring consistent results between platforms through custom wrappers and UDFs. The Count Distinct metric is optimized for speed and precision, offering a fast, scalable solution for analyzing exploration and variety in user interactions, and is designed to be easily integrated into existing workflows.