Data pipelines and data lakehouses

Post Details

Company

Starburst

Date Published

July 20, 2023

Author

Evan Smith

Word Count

2,122

Language

English

Hacker News Points

-

Source URL

www.starburst.io/blog/data-pipelines-and-data-lakes

Summary

Data lakehouses have evolved from traditional data lakes, offering enhanced capabilities by integrating modern table formats like Apache Iceberg, Delta Lake, and Hudi, which enable advanced querying and data processing. The architecture typically involves a three-part data pipeline comprising the Land, Structure, and Consume layers. Data begins in its raw state in the Land layer, is transformed and validated in the Structure layer, and finally becomes consumable in the Consume layer for use in business intelligence tools and data products. This process can utilize batch or streaming ingestion methods, with technologies like Kafka, Flink, and Apache Spark facilitating these operations. The use of SQL for data normalization, validation, enrichment, and technical transformation is crucial in constructing these layers, with tools such as Starburst Galaxy and Starburst Enterprise providing integration and support throughout the pipeline. Data consumption is primarily through queries, business intelligence tools, and curated data products, allowing for enhanced data visibility, discovery, and governance.