Why create an open source extension of Data Diff?

Post Details

Company

Datafold

Date Published

Aug. 9, 2022

Author

Matthew David

Word Count

879

Language

English

Hacker News Points

-

Source URL

www.datafold.com/blog/data-engineering-podcast-open-source-data-diff

Summary

In a Data Engineering Podcast episode, Gleb Mezhanskiy and Simon Eskildsen discuss Datafold's open-source data-diff tool with host Tobias Macey, exploring its origins, design decisions, and practical applications. The tool was developed to automate the tedious task of regression testing in data engineering and was released as open source to allow for community-driven enhancements, particularly for replication validation. Python was chosen for its widespread use in the data community, despite concerns about speed, as the heavy lifting is done in data stores. The use of the md5 checksum algorithm is justified by its ubiquity, despite potential hash collision issues, and the aim is to eventually adopt more sophisticated techniques for detecting changes. They also address challenges such as comparing data types across different stores and note the limitations of open-source data-diff, particularly when dealing with large datasets or data stores lacking aggregation engines. The episode concludes with a reflection on the broader gaps in data management technology and the aspiration to enhance tools that improve data engineers' workflow efficiency, ultimately empowering more effective use of data.