The file explosion problem in Apache Iceberg and what to do when it happens to you

Post Details

Company

Starburst

Date Published

March 12, 2025

Author

Daniel Abadi

Word Count

2,900

Language

English

Hacker News Points

-

Source URL

www.starburst.io/blog/apache-iceberg-files

Summary

Apache Iceberg is a tool used to manage metadata for datasets stored in open file formats like Parquet, ORC, and Avro, enabling features such as schema evolution, time travel, and concurrent data access by tools like Trino, Starburst, and Spark. However, its robust capabilities can lead to a proliferation of small metadata files, causing performance issues as the number of files increases, which affects query speed and system scalability. This problem, commonly referred to as the "file explosion problem," arises because Iceberg generates new metadata files for each data modification to maintain transactional integrity and historical snapshots. Solutions such as deleting old snapshots and file compaction are employed to mitigate these issues, with file compaction being a key strategy because it reduces the number of small files by merging them into larger ones without losing data integrity or historical capability. Effective compaction requires strategic prioritization of resources, potentially involving separate computing clusters, to ensure that high-priority tables are targeted for compaction, thereby maintaining efficient query performance.