Iterating terabyte-sized ClickHouse ® tables in production
Blog post from Tinybird
Tinybird's evolution from batch to real-time data ingestion necessitated a strategic approach to schema migrations, particularly when dealing with streaming data. Initially, Tinybird allowed users to ingest CSV files into ClickHouse® clusters, but as customer needs shifted towards real-time data processing, the company integrated streaming connectors like Kafka and Kinesis. This change introduced complexities, especially in managing schema migrations without disrupting data flow or impacting user-facing products. To address these challenges, Tinybird implemented a git integration, enabling version control, CI/CD, and automated testing to facilitate safer and more efficient schema migrations. This approach allows for testing changes in a production-like environment and ensures that any mistakes can be easily rolled back. The integration with git and the use of Materialized Views helped Tinybird execute complex data operations without compromising the stability of its services, ultimately reducing customer debugging time significantly. This approach underscores the importance of applying software development best practices to data engineering challenges, ensuring that schema migrations in high-throughput streaming systems are manageable and reliable.