Home / Companies / Snowplow / Blog / Post Details
Content Deep Dive

De-duplicating Events in Redshift and Hadoop: A Technical Tutorial for Snowplow Users

Blog post from Snowplow

Post Details
Company
Date Published
Author
Snowplow Team
Word Count
651
Language
English
Hacker News Points
-
Summary

Snowplow's pipeline is engineered to ensure data integrity, yet duplicates can occasionally occur, particularly affecting analyses in Redshift due to problematic joins. This tutorial addresses best practices for detecting, managing, and eliminating duplicates using Snowplow's batch pipeline, Redshift SQL, and Hadoop de-duplication jobs, providing insights for data engineers and platform maintainers. Duplicates primarily arise from tracker retry logic, where identical payloads are resent without confirmation, and JavaScript tracker edge cases involving bots or spiders causing non-unique event_ids. Snowplow mitigates duplicates with a Hadoop de-duplication job that significantly reduces duplicates before data reaches Redshift, though it does not handle synthetic duplicates across batches. For existing duplicates in Redshift, deduplication strategies include joining on both event_id and collector_tstamp, deduplicating in the modeling layer using dbt, and using SQL scripts to move duplicates to a separate schema. Best practices recommend updating joins to use both event_id and collector_tstamp, staging deduplication scripts, customizing fingerprint logic, and investigating anomalies that may indicate bot activity or tracker misconfigurations.