Parquet and ORC’s many shortfalls for machine learning (ML) workloads, and what should be done about them
Blog post from Starburst
Columnar storage formats like Apache Parquet and ORC have revolutionized data management by offering significant performance improvements for traditional SQL workloads, but they face several challenges when applied to modern machine learning (ML) workloads. Originally designed for tasks involving large-scale data scanning, joining, and aggregation, these formats now struggle with the extreme characteristics of ML datasets, such as wide and sparse tables, vector handling, and compliance with data privacy regulations like GDPR and CCPA. The metadata read time for wide tables can dominate query execution, and existing compression techniques are not optimized for vectors, leading to inefficiencies. Additionally, the block-based compression complicates in-place deletions necessary for compliance, often requiring costly file rewrites. Although these formats remain valuable, their limitations indicate a need for new columnar formats specifically designed for ML workloads, which would incorporate optimizations like direct metadata access and improved in-place delete functions to better meet current data processing demands.