Testing and Deploying PySpark Jobs with Dagster

Company

Dagster

Date Published

Sept. 16, 2020

Author

Sandy Ryza

Word count

1569

Language

English

Hacker News points

None

URL

dagster.io/blog/pyspark

Summary

Apache Spark is a powerful tool for data processing, but its development can be challenging due to the need for different setups for various stages of the development cycle. These setups demand drastically different configurations, such as local setups with small data for quick error detection, representative sample datasets for data edge cases, and production-sized datasets on clusters for performance issues. Dagster, a data orchestrator, addresses this complexity by organizing Spark code and deployment setups, providing pre-built utilities for deploying Spark code to environments like EMR and Databricks. It cleanly separates business logic from setup configurations, defining DAGs of Python functions called solids that can be run in different modes, such as local development, EMR, or production. The use of Dagster's integration with PySpark allows developers to easily switch between these setups and deploy code on each job run, automating packaging and S3 upload for a tighter development loop.