Sparse versus dense document values with Apache Lucene

Post Details

Company

Elastic

Date Published

Nov. 22, 2016

Author

Michael McCandless

Word Count

1,179

Language

-

Hacker News Points

-

Source URL

www.elastic.co/blog/sparse-versus-dense-document-values-with-apache-lucene

Summary

Apache Lucene has undergone significant changes in how document values (doc values) are indexed and accessed, aiming to enhance performance and ensure users only pay for what they use. These updates, which will feature in the upcoming Lucene 7.0 release, move from a random-access API to a more restrictive iterator API, allowing for better compression and optimization, particularly benefiting sparse cases. Improvements include a new codec design to remove abstraction layers, implement sparse cases directly, and introduce a faster advanceExact API for specific document targeting. These changes have resulted in improved search performance and reduced index sizes, as demonstrated by new benchmarks using the New York City taxi ride data corpus. The benchmarks, which test both sparse and dense documents, indicate significant performance enhancements, despite initial setbacks from transitioning APIs. Additionally, index-time sorting, although slower during indexing, offers notable search speedups, illustrating a beneficial trade-off for many users. This evolution in Lucene underscores the importance of automated benchmarks in detecting performance regressions and guiding optimizations.