A Deep Dive into Apache Parquet with ClickHouse - Part 2

Company

ClickHouse

Date Published

April 26, 2023

Author

Dale McDiarmid

Word count

4700

Language

English

Hacker News points

None

URL

clickhouse.com/blog/apache-parquet-clickhouse-local-querying-writing-internals-row-groups

Summary

The Parquet format is a column-based storage format that offers great compression relative to other interchange formats. It relies on three principal concepts: row groups, column chunks, and pages. Understanding these concepts allows users to make decisions when writing files, which will directly impact the level of compression and subsequent read performance. The format includes metadata, such as references back to row groups, chunks, and pages, which can be exploited by query engines to skip column chunks. Parquet offers various encoding techniques, including dictionary encoding, Run Length Encoding (RLE), and Delta encoding, which can be applied to different data types. The encoding used for a column cannot be controlled with settings, but future improvements aim to address this limitation. ClickHouse supports Parquet files through the file function, and recent developments have improved parallelization within a single file, assigning a thread to each row group that is responsible for reading and decoding. This improvement significantly improves performance when reading large files. The optimal number of row groups can impact performance; too few row groups may not utilize available CPU cores, while too many row groups may lead to high memory consumption. A balance between parallel decoding and efficient reading is required. ClickHouse also supports multiple files in a directory, but this approach has limitations, such as requiring file listing operations and limited support for schema evolution or write consistency. Modern data formats like Apache Iceberg aim to address these challenges by providing SQL table-like functionality to files in a data lake. The blog post concludes that Parquet continues to evolve and improve its support, with planned improvements including exploiting metadata for conditions in any WHERE clause and improving logical type support.