Kinesis Streams vs. S3 Buckets: Whatâs the Best Choice for Your Snowplow Pipeline?
Blog post from Snowplow
Snowplow pipelines on AWS utilize both Kinesis Streams and S3 Buckets, each serving distinct roles in data processing and storage. Kinesis Streams are used for real-time data streaming, offering low latency and supporting multiple consumers, making them suitable for tasks like collecting raw events and stream enrichment. In contrast, S3 Buckets provide persistent storage for raw and enriched data, facilitating batch processing and data lake integration, essential for preventing data loss and enabling downstream processing. Oversized or malformed events in Kinesis are rerouted to a bad stream or captured by S3 for later analysis, with Kinesis Firehose optionally writing directly to S3, although it adds latency due to buffering. The S3 Loader can be configured to run on the same instance as the collector for low data volumes or on dedicated instances or containers for higher throughput, ensuring that Snowplow pipelines remain robust, scalable, and fault-tolerant by leveraging the complementary strengths of Kinesis Streams and S3 Buckets.