Home / Companies / Snowplow / Blog / Post Details
Content Deep Dive

Debugging bad data in GCP with BigQuery

Blog post from Snowplow

Post Details
Company
Date Published
Author
Colm O Griobhtha
Word Count
1,584
Language
English
Hacker News Points
-
Summary

The Snowplow pipeline, designed to prioritize data quality from the outset, uses predefined schemas to validate incoming data, with invalid data preserved as bad rows for further analysis. This process is crucial for debugging, especially in the Google Cloud Platform (GCP) environment where bad rows are streamed to Cloud Storage. Users can create external tables in BigQuery to monitor these bad rows in real-time and native tables for more detailed, cost-effective analysis. The guide explains how to handle bad rows by identifying validation failures, which may arise from non-Snowplow traffic or schema mismatches, and provides instructions for querying and decoding bad row data to diagnose issues. The process involves creating external and native tables, using SQL to limit queries and count errors, and decoding base64-encoded data to identify the source of validation errors, ultimately allowing users to refine their data tracking and schemas for improved data quality.