Automating BigQuery load jobs from GCS: Our scalable approach
Blog post from Statsig
Statsig developed a flexible and dynamic data ingestion system to efficiently load data from Google Cloud Storage into BigQuery, addressing the limitations of their initial rigid setup. The new system, built with Python and managed by an orchestrator, dynamically detects and ingests data by automating bucket discovery and organizing files into time-based buckets, while reliably tracking job statuses using MongoDB and BigQuery's INFORMATION_SCHEMA. The declarative system compares desired and actual states to identify and execute necessary load jobs, ensuring consistency and accuracy. This approach not only facilitates the rapid onboarding of new data sources without manual intervention but also optimizes resource usage by avoiding unnecessary operations. The system processes over a trillion rows daily, emphasizing its scalability, reliability, and efficiency in handling large datasets.