Scheduling batch ingestion with Apache Airflow

Company

Imply

Date Published

Oct. 29, 2025

Author

Kyle Hoondert

Word count

1421

Language

English

Hacker News points

None

URL

imply.io/blog/scheduling-batch-ingestion-with-apache-airflow

Summary

Navigating the complexities of batch data ingestion, especially when integrating Apache Airflow with Apache Druid, is the focus of this guide, emphasizing the importance of batch processes for historical data when real-time streams falter. It provides a comprehensive walkthrough on setting up Airflow using Docker, highlighting the need to configure and update the system to include the Druid Provider, which is crucial for smooth data ingestion into Druid. The guide details configuring the environment, adding necessary connections, and parameterizing ingestion specifications using Jinja templating to enable repeatable and reliable data pipelines. Additionally, it outlines creating and executing a Directed Acyclic Graph (DAG) in Airflow, using a sample for Druid native ingestion, and underscores troubleshooting methods to ensure the correct operation of batch ingestion tasks, including checking logs to verify the validity and parameter interpretation of the ingestion specs.