Apache Airflow for Orchestration and Monitoring of Apache Druid
Blog post from Rill
The text outlines a comprehensive approach to observability and data health checks within data pipelines, focusing on the integration of Apache Druid with systems like Apache Airflow, Opsgenie, and Slack. It emphasizes the importance of maintaining data quality and completeness from the initial stages of raw data processing through to analysis, using both static rule checks and dynamic, data-driven tests to ensure accuracy and reliability. The article discusses the trade-offs between cost, timeliness, and data validation, advocating for an iterative approach to testing, starting small and expanding as more is learned from production pipelines. It highlights the need to identify root causes of pipeline ingestion failures quickly and to automate responses where possible to minimize data lag. Monitoring end-user performance is also crucial, particularly for optimizing query latency on massive datasets, with the use of Rill Explore dashboards to diagnose issues. The text concludes by suggesting the inclusion of business stakeholders in the alerting process and conducting post-mortems to refine workflows and reduce future issues, sharing insights gained from their journey in maintaining an always-on observability system.