How to build data transformations with Python, Ibis, and Starburst Galaxy
Blog post from Starburst
The combination of Starburst Galaxy and Ibis offers a powerful solution for building data-intensive applications by connecting cloud data sources for processing and analysis with the optimized Trino clusters of Galaxy, ultimately presenting the data to end users through Ibis's pandas-like API. This setup provides an efficient workflow for data scientists, enabling complex data manipulations and computations across various analytical backend systems. The process involves setting up a Starburst Galaxy account, connecting a data lake catalog, and using schema discovery to register and query datasets as tables, which facilitates automation of new file discovery within the data lake. The integration is demonstrated through a tutorial using NYC Taxi trip data, illustrating how to prepare, upload, and analyze datasets with this combination, leveraging Trino's fast distributed SQL query engine and Ibis's user-friendly interface.