Company
Date Published
Author
Rajat Venkatesh
Word count
954
Language
English
Hacker News points
None

Summary

YugabyteDB databases can export data to Amazon S3 using Change Data Capture (CDC) and open table formats like Apache Iceberg, achieving low-latency data ingestion while avoiding costly rewrites. Log-based CDC is recommended for low latency export due to its ability to capture all change types and process changes without competing with other workloads. File formats such as CSV, JSON, Parquet, and ORC are popular choices for building file-based data lakes on object stores like Amazon S3. Transactional data lake technologies like Apache Iceberg add metadata layers to provide ACID transactions and schema evolution. A CDC pipeline using Debezium and Kafka supports low-latency export, consisting of two stages: exporting CDC events from YugabyteDB to Kafka and writing records to Apache Iceberg tables in Amazon S3. The pipeline provides separation of concerns between the two stages, allowing each component to be paused or restarted independently. Two modes are available when writing data to Amazon S3: Trace Inserts, Updates, and Deletes, and Replay Inserts, Updates, and Deletes, with the latter providing atomic transactions and primary key-level ordering of transactions sufficient for most OLAP databases.