PyStarburst: the DataFrame API
Blog post from Starburst
Starburst Galaxy has introduced Python DataFrames, currently in public preview, allowing users to utilize the DataFrame API, commonly associated with Spark, for data manipulation within the Starburst environment. The blog post by Lester Martin provides an overview of setting up the environment, including necessary installations like Python and pip, and details the process of using PyStarburst, a Python library for interacting with Starburst Galaxy. The post includes step-by-step examples demonstrating the creation and manipulation of DataFrames, showcasing operations such as table selection, filtering, joining, and sorting data from the TPCH dataset. It highlights the ease of chaining methods to perform data operations and suggests using the sql() method as an alternative for executing SQL queries directly. The post emphasizes the flexibility of PyStarburst for data engineers who prefer programming over SQL, while also noting that complex SQL queries can be simplified using the DataFrame API, which ultimately translates the operations into efficient SQL queries executed by the Trino engine on Starburst Galaxy.