Building highly reliable data pipelines at Datadog

Company

Datadog

Date Published

April 2, 2019

Author

Quentin Francois

Word count

2061

Language

English

Hacker News points

URL

www.datadoghq.com/blog/engineering/highly-reliable-data-pipelines

Summary

Quentin Francois, a data engineer at Datadog, shares the company's best practices for building reliable data pipelines. According to Francois, reliability is not about never failing, but rather ensuring that data pipelines deliver good data in time. To achieve this, it's essential to design pipelines with fault tolerance and monitoring in mind. This includes using object stores, cloud Hadoop/Spark services, and spot instances, which can fail at any time due to supply and demand fluctuations. However, with proper architecture, clustering, and monitoring, failures can be anticipated and recovered from quickly. Key strategies include breaking down jobs into smaller pieces, monitoring cluster metrics, job metrics, and data latencies, and thinking ahead of potential failures. By following these best practices, data engineers can build more reliable data pipelines that meet the needs of downstream users.