Icehouse experimentation and migration
Blog post from Starburst
Data virtualization is crucial for managing data during experimentation with new systems and migration processes, as it provides a unified interface to access data spread across multiple systems, including Icehouses and legacy data warehouses. The two main types of data virtualization architectures are pull-based and push-based systems. Pull-based systems retrieve data from underlying systems to process locally, while push-based systems delegate as much data processing as possible to the underlying systems, reducing data transfer. Despite the efficiency of push-based systems in minimizing data transfer, pull-based systems are more common today due to their easier implementation and compatibility with modern, high-throughput networks. In scenarios where data from multiple systems needs to be accessed, data virtualization systems are indispensable, minimizing the complexity of data management and integration. However, performance considerations remain crucial, with pull-based systems potentially facing bottlenecks in data extraction and push-based systems encountering compatibility issues. For Icehouses, which utilize Iceberg tables and distributed storage, pull-based virtualization is the norm, as Iceberg lacks built-in query processing capabilities that push-based systems could utilize. This discussion underscores the importance of understanding the trade-offs and performance implications of using different data virtualization architectures, especially when dealing with complex data environments like Icehouses.