Apache Iceberg DML (update/delete/merge) & Maintenance in Trino
Blog post from Starburst
Apache Iceberg, in combination with Trino, enhances data lakehouse capabilities by facilitating database-type updates, deletes, and merges on data stored in immutable object storage like AWS S3. This functionality addresses the challenges of modifying data at the row level, which was problematic in the Hadoop era. Iceberg's DML (Data Manipulation Language) support allows for full transaction logging, ensuring changes are reflected in subsequent queries. Inserts are commonly executed, while updates and deletes maintain read integrity by using snapshots. Merges provide logical operations for updating and inserting data, enabling complex operations like Slowly Changing Dimension (SCD Type 2) and Change Data Capture (CDC). Optimizing the Iceberg tables periodically is recommended to enhance performance by consolidating small files and cleaning up metadata, which is crucial for maintaining efficient query execution in active tables. These features significantly expand the use cases and reliability of data lakehouses across various cloud platforms.