Company
Date Published
Author
Sandy Ryza
Word count
1569
Language
English
Hacker News points
None

Summary

Apache Spark is a powerful tool for data processing, but its development can be challenging due to the need for different setups for various stages of the development cycle. These setups demand drastically different configurations, such as local setups with small data for quick error detection, representative sample datasets for data edge cases, and production-sized datasets on clusters for performance issues. Dagster, a data orchestrator, addresses this complexity by organizing Spark code and deployment setups, providing pre-built utilities for deploying Spark code to environments like EMR and Databricks. It cleanly separates business logic from setup configurations, defining DAGs of Python functions called solids that can be run in different modes, such as local development, EMR, or production. The use of Dagster's integration with PySpark allows developers to easily switch between these setups and deploy code on each job run, automating packaging and S3 upload for a tighter development loop.