Building high-performance full-text search for object storage
Blog post from ClickHouse
The ClickHouse engineering team has reimagined the full-text indexing system to optimize its performance for use with object storage, focusing on sequential access patterns to overcome the latency challenges inherent in remote storage. This redesign allows queries to be executed directly from the index, reducing the need to access the full dataset and thus improving efficiency. The text index is composed of three main components: a dictionary file that stores indexed tokens, a sparse dictionary index for fast lookups, and a posting list that maps tokens to their respective row positions. By employing techniques like block-based layouts, front-coded compression, and Roaring Bitmaps, the new index design ensures efficient storage and retrieval, even when handling vast datasets. Additionally, the index supports complex queries through direct read modes and optimized execution paths, minimizing unnecessary I/O operations. This enhancement aligns with ClickHouse Cloud's architecture, facilitating distributed processing across multiple nodes and leveraging shared object storage for scalable full-text search capabilities.