Home / Companies / Starburst / Blog / Post Details
Content Deep Dive

Building Starburst Data Pipelines with SQL or Python

Blog post from Starburst

Post Details
Company
Date Published
Author
Lester Martin
Word Count
1,721
Language
English
Hacker News Points
-
Summary

Starburst, known primarily as a SQL query engine, also serves as a versatile platform for ETL workloads, accommodating both SQL-oriented and Python-based data pipelines. Its evolution from Trino, originally developed for interactive querying, has seen widespread adoption for ETL tasks, replacing Hive in many contexts due to its superior query execution speed. Starburst's architecture employs a distributed processing mechanism using a directed acyclic graph (DAG), enhancing speed by not persisting intermediary data to disk and instead functioning like a streaming engine. However, initial limitations in handling long-running queries and memory-intensive tasks have been addressed by incorporating fault-tolerant execution (FTE), which allows for more reliable, stage-by-stage processing. Data engineers can construct pipelines using SQL or Python, with PyStarburst and Ibis offering Dataframe API alternatives for executing Python code in Trino clusters. The platform also supports orchestration tools like Airflow and Dagster, making it a robust choice for transformation processing jobs, further bolstered by its fault-tolerant execution mode and flexibility in using SQL or Python.