Dealing with duplicate event IDs

Post Details

Company

Snowplow

Date Published

Aug. 19, 2015

Author

Snowplow Team

Word Count

1,582

Language

English

Hacker News Points

-

Source URL

snowplow.io/blog/dealing-with-duplicate-event-ids

Summary

The Snowplow pipeline, which tracks data events, often encounters issues with duplicate event IDs, which are ideally supposed to be unique. Duplicates can arise from within the pipeline itself (endogenous duplicates) or from client-side processes (exogenous duplicates) that send events with the same ID. While endogenous duplicates are true duplicates with identical client-sent fields, exogenous duplicates often result from external influences like browser pre-cachers or web scrapers. To manage these duplicates, Snowplow employs a deduplication algorithm that either deletes redundant events or assigns new IDs while maintaining their relationship to the original event. In Redshift, deduplication involves SQL queries that remove duplicates and manage them in separate tables to ensure the uniqueness of event IDs in the main dataset. Snowplow plans to extend this deduplication approach to Amazon Kinesis, although the process is complicated by the potential introduction of endogenous duplicates within KCL applications.