Home / Companies / Snowplow / Blog / Post Details
Content Deep Dive

Dealing with duplicate event IDs

Blog post from Snowplow

Post Details
Company
Date Published
Author
Snowplow Team
Word Count
1,582
Language
English
Hacker News Points
-
Summary

The Snowplow pipeline, which tracks data events, often encounters issues with duplicate event IDs, which are ideally supposed to be unique. Duplicates can arise from within the pipeline itself (endogenous duplicates) or from client-side processes (exogenous duplicates) that send events with the same ID. While endogenous duplicates are true duplicates with identical client-sent fields, exogenous duplicates often result from external influences like browser pre-cachers or web scrapers. To manage these duplicates, Snowplow employs a deduplication algorithm that either deletes redundant events or assigns new IDs while maintaining their relationship to the original event. In Redshift, deduplication involves SQL queries that remove duplicates and manage them in separate tables to ensure the uniqueness of event IDs in the main dataset. Snowplow plans to extend this deduplication approach to Amazon Kinesis, although the process is complicated by the potential introduction of endogenous duplicates within KCL applications.