Home / Companies / Redpanda / Blog / Post Details
Content Deep Dive

Streaming optimized data to S3 for analytics with Parquet

Blog post from Redpanda

Post Details
Company
Date Published
Author
Chandler Mayo
Word Count
716
Language
English
Hacker News Points
-
Summary

Redpanda provides a method for exporting streaming data into Amazon S3, initially using JSON, but for more complex analytical workloads, Apache Parquet is recommended due to its binary, columnar format that is efficient for analytics, compressing well and loading quickly. Redpanda Connect can encode streaming data directly into Parquet files, which can serve various purposes such as JSON for web applications and Parquet for data analytics, facilitating seamless data integration. In this process, Redpanda reads messages from a topic, encodes them into Parquet format, and writes the compressed files to S3, requiring a schema definition for data structuring. The use of the zstd compression algorithm optimizes storage costs, and the resulting Parquet files can be queried using tools like Pandas, Spark, or Athena. The series discusses setting up a Redpanda Connect pipeline, which continuously processes data until manually stopped, and emphasizes the importance of cleanup and adjusting configurations for production environments. The next installment will explore integrating SQS and S3 notifications for building event-driven workflows that respond to new data in real time.