De-duplicating Events in Redshift and Hadoop: A Technical Tutorial for Snowplow Users

Post Details

Company

Snowplow

Date Published

Oct. 4, 2024

Author

Snowplow Team

Word Count

651

Language

English

Hacker News Points

-

Source URL

snowplow.io/blog/de-duplicating-events-in-redshift-and-hadoop-a-technical-tutorial-for-snowplow-users

Summary

Snowplow's pipeline is engineered to ensure data integrity, yet duplicates can occasionally occur, particularly affecting analyses in Redshift due to problematic joins. This tutorial addresses best practices for detecting, managing, and eliminating duplicates using Snowplow's batch pipeline, Redshift SQL, and Hadoop de-duplication jobs, providing insights for data engineers and platform maintainers. Duplicates primarily arise from tracker retry logic, where identical payloads are resent without confirmation, and JavaScript tracker edge cases involving bots or spiders causing non-unique event_ids. Snowplow mitigates duplicates with a Hadoop de-duplication job that significantly reduces duplicates before data reaches Redshift, though it does not handle synthetic duplicates across batches. For existing duplicates in Redshift, deduplication strategies include joining on both event_id and collector_tstamp, deduplicating in the modeling layer using dbt, and using SQL scripts to move duplicates to a separate schema. Best practices recommend updating joins to use both event_id and collector_tstamp, staging deduplication scripts, customizing fingerprint logic, and investigating anomalies that may indicate bot activity or tracker misconfigurations.