Home / Companies / Statsig / Blog / Post Details
Content Deep Dive

How Statsig’s data platform processes hundreds of petabytes daily

Blog post from Statsig

Post Details
Company
Date Published
Author
Pushpendra Nagtode
Word Count
1,953
Language
English
Hacker News Points
-
Summary

Statsig's data-driven experimentation and analytics platform processes over 100 petabytes of data daily, supporting trillions of events and serving more than 2,000 companies. Its architecture is designed to handle the complexities of scaling, data ingestion, and cost efficiency, employing a hybrid model of BigQuery for analytics and Spark/Iceberg for large-scale data processing. The platform's evolution from Databricks to BigQuery was driven by the need for improved scalability and cost management, leading to custom solutions like the Statsig Builder Tool for flexible, multi-language workflows and a tailored orchestration system for diverse data sources. Key strategies include optimizing BigQuery usage, implementing a robust data quality framework, leveraging Iceberg Storage Partition Joins to reduce shuffle inefficiencies, and using spot nodes for cost-effective Spark performance. These innovations ensure that Statsig maintains a scalable, efficient, and reliable platform capable of meeting the diverse needs of its users while continuously refining its technology to enhance performance and cost-effectiveness.