Stream Data to Amazon S3 Using YugabyteDB CDC and Apache Iceberg

Company

Yugabyte

Date Published

Oct. 13, 2022

Author

Rajat Venkatesh

Word count

954

Language

English

Hacker News points

None

URL

www.yugabyte.com/blog/stream-data-to-amazon-s3-using-yugabytedb-cdc

Summary

YugabyteDB databases can export data to Amazon S3 using Change Data Capture (CDC) and open table formats like Apache Iceberg, achieving low-latency data ingestion while avoiding costly rewrites. Log-based CDC is recommended for low latency export due to its ability to capture all change types and process changes without competing with other workloads. File formats such as CSV, JSON, Parquet, and ORC are popular choices for building file-based data lakes on object stores like Amazon S3. Transactional data lake technologies like Apache Iceberg add metadata layers to provide ACID transactions and schema evolution. A CDC pipeline using Debezium and Kafka supports low-latency export, consisting of two stages: exporting CDC events from YugabyteDB to Kafka and writing records to Apache Iceberg tables in Amazon S3. The pipeline provides separation of concerns between the two stages, allowing each component to be paused or restarted independently. Two modes are available when writing data to Amazon S3: Trace Inserts, Updates, and Deletes, and Replay Inserts, Updates, and Deletes, with the latter providing atomic transactions and primary key-level ordering of transactions sufficient for most OLAP databases.