How to use dbt and Trino with Iceberg for a change data capture on a data lake
Blog post from Starburst
The article explores how to utilize dbt and Trino with Iceberg for implementing change data capture (CDC) on a data lake, specifically using Amazon DMS data stored in CSV format on S3. The process involves creating an external table to read the data, followed by developing a model named stg_dms__products that employs dbt's incremental materialization to process only new CDC records. The article outlines the use of common table expressions (CTEs) for handling insert, update, and delete operations, and discusses strategies for implementing soft deletes and hard deletes. Key techniques include generating a MERGE statement for efficient data updates and applying incremental strategies to enhance performance. Additionally, it advises on configuring Iceberg table properties and using post_hooks for operations like expiring snapshots. The article provides practical examples and configurations for these processes, and the complete dbt project is available on a GitHub repository.