How Statsig’s data platform processes hundreds of petabytes daily

Post Details

Company

Statsig

Date Published

Feb. 12, 2025

Author

Pushpendra Nagtode

Word Count

1,953

Language

English

Hacker News Points

-

Source URL

www.statsig.com/blog/statsig-data-platform-process-petabytes-daily

Summary

Statsig's data-driven experimentation and analytics platform processes over 100 petabytes of data daily, supporting trillions of events and serving more than 2,000 companies. Its architecture is designed to handle the complexities of scaling, data ingestion, and cost efficiency, employing a hybrid model of BigQuery for analytics and Spark/Iceberg for large-scale data processing. The platform's evolution from Databricks to BigQuery was driven by the need for improved scalability and cost management, leading to custom solutions like the Statsig Builder Tool for flexible, multi-language workflows and a tailored orchestration system for diverse data sources. Key strategies include optimizing BigQuery usage, implementing a robust data quality framework, leveraging Iceberg Storage Partition Joins to reduce shuffle inefficiencies, and using spot nodes for cost-effective Spark performance. These innovations ensure that Statsig maintains a scalable, efficient, and reliable platform capable of meeting the diverse needs of its users while continuously refining its technology to enhance performance and cost-effectiveness.