Thinking in Assets When Building Data Pipelines

Company

Dagster

Date Published

Feb. 5, 2024

Author

Tim Castillo

Word count

3432

Language

English

Hacker News points

None

URL

dagster.io/blog/thinking-in-assets

Summary

A data pipeline is a series of tasks that are orchestrated to collect, process, and transform raw data into a usable format. In this context, a data pipeline in Dagster refers to the creation of data assets, which are files or tables that contain specific data. The pipeline is designed by identifying what data assets need to be produced, breaking them down into more atomic assets, and repeating the process until reaching the source data. The pipeline can be implemented using Python code, with each asset specifying its dependencies. Dagster provides features such as auto-materialization policies, which define when an asset should be materialized, and asset checks, which verify data quality for each asset in the pipeline. The final state of the code includes all assets implemented, scheduled, documented, and some with data quality tests attached.