The file explosion problem in Apache Iceberg and what to do when it happens to you
Blog post from Starburst
Apache Iceberg is a tool used to manage metadata for datasets stored in open file formats like Parquet, ORC, and Avro, enabling features such as schema evolution, time travel, and concurrent data access by tools like Trino, Starburst, and Spark. However, its robust capabilities can lead to a proliferation of small metadata files, causing performance issues as the number of files increases, which affects query speed and system scalability. This problem, commonly referred to as the "file explosion problem," arises because Iceberg generates new metadata files for each data modification to maintain transactional integrity and historical snapshots. Solutions such as deleting old snapshots and file compaction are employed to mitigate these issues, with file compaction being a key strategy because it reduces the number of small files by merging them into larger ones without losing data integrity or historical capability. Effective compaction requires strategic prioritization of resources, potentially involving separate computing clusters, to ensure that high-priority tables are targeted for compaction, thereby maintaining efficient query performance.