Iceberg Snapshots Affect Storage, Not Performance
Blog post from Starburst
Apache Iceberg's architecture utilizes snapshots for version control, allowing features like time-travel querying and rollback functionality without impacting query performance, as queries primarily interact with the current version. However, maintaining multiple snapshots increases the storage footprint on data lakes since each version's data files must be retained. While data lake providers allow extensive data storage, they charge for it, necessitating strategies such as expiring older snapshots to manage storage costs effectively. Iceberg's Merge-on-Read strategy, which handles updates and deletes without updating existing files, contributes to storage efficiency by minimizing the need for additional storage beyond what is currently used, although compaction processes can increase storage temporarily. Consequently, regular maintenance, including snapshot expiration and orphan file cleanup, is recommended to reclaim storage while balancing the need for historical version retention.