Bodo Plus Snowflake: Bringing Extreme Performance to Massive-Scale ETL Using Python

Company

Bodo

Date Published

May 2, 2022

Author

Ritwika Ghosh and Alireza Farhidzadeh

Word count

1302

Language

English

Hacker News points

None

URL

www.bodo.ai/blog/bodo-plus-snowflake-bringing-extreme-performance-to-massive-scale-etl-using-python

Summary

Snowflake Data Cloud simplifies data management for data engineers at a near-unlimited scale, while the Bodo Platform brings extreme performance and scalability to large-scale Python data processing. Snowflake and Bodo have combined forces to enable data teams to complete very large ETL and data-prep jobs in a fraction of the time with better efficiency than they could with Snowflake alone. This is achieved through Bodo's supercomputing-style MPI-based parallel approach to data analytics using native Python, which is an order-of-magnitude faster and far more resource-efficient than using Apache Spark. Combining these best-in-class storage and compute solutions requires very efficient data movement between the two platforms, which has been a key focus of Bodo's partnership with the Snowflake team. Bodo's Snowflake Ready connector lets Bodo clusters read terabytes of Snowflake data extremely fast, built-in and fully automatic while delivering high performance similar to Parquet on S3 datasets. Reading large amounts of data from Snowflake using Bodo is demonstrated to be faster than reading the same data from Parquet files from AWS S3, with Bodo taking under 3 minutes to read 1TB of data compared to Snowflake taking around 154 seconds. The Bodo JIT compiler optimizes queries by selecting only necessary columns and pushing down filters automatically, which simplifies data pipelines. Additionally, Bodo's distributed fetch mechanism reads data into parallel chunks on which cores will execute the application in parallel, enabling developers to build highly performant data pipelines with fast distributed fetch and parallelized computation.