
Fast data cataloging of streaming data for fun and privacy

What's this blog post about?

Gretel's REST APIs automatically build a metastore that makes it easy to understand what is inside of your data, making it accessible for developers familiar with the concept. The metastore helps in identifying primary identifiers and other potentially identifiable fields within the data. By using word embeddings and comparing field names, Gretel can infer relationships between schema field names without analyzing the field contents. In a case study involving GBFS free_bike_status.json feeds, it was found that some data producers were generating ephemeral bike_id values on a steady interval, while the jump_vehicle_name field had a one-to-one mapping with an actual vehicle. The privacy implications of this data are real, as ride reconstruction for specific individuals could be possible by observing bikes through the app and aggregated feed data. Gretel's disclosure on GitHub led to the removal of the data from the feeds within hours.


Date published
Sept. 1, 2020

John Myers

Word count

Hacker News points
None found.


By Matt Makai. 2021-2024.