Debugging bad data in GCP with BigQuery

Post Details

Company

Snowplow

Date Published

Dec. 19, 2018

Author

Colm O Griobhtha

Word Count

1,584

Language

English

Hacker News Points

-

Source URL

snowplow.io/blog/debugging-bad-rows-on-gcp-with-bigquery

Summary

The Snowplow pipeline, designed to prioritize data quality from the outset, uses predefined schemas to validate incoming data, with invalid data preserved as bad rows for further analysis. This process is crucial for debugging, especially in the Google Cloud Platform (GCP) environment where bad rows are streamed to Cloud Storage. Users can create external tables in BigQuery to monitor these bad rows in real-time and native tables for more detailed, cost-effective analysis. The guide explains how to handle bad rows by identifying validation failures, which may arise from non-Snowplow traffic or schema mismatches, and provides instructions for querying and decoding bad row data to diagnose issues. The process involves creating external and native tables, using SQL to limit queries and count errors, and decoding base64-encoded data to identify the source of validation errors, ultimately allowing users to refine their data tracking and schemas for improved data quality.