Company
Date Published
Author
Michael McCandless
Word count
1367
Language
-
Hacker News points
None

Summary

Apache Lucene handles deleted documents by marking them in a per-segment bitset, deferring the reclaiming of disk space and term statistics updates until segments are merged. This approach avoids the high cost of immediate updates to the index's data structures and statistics but results in temporary disk space occupation and continued RAM usage for deleted documents. The default merge policy, TieredMergePolicy, prioritizes merging segments with more deletions to reclaim space, though overly aggressive settings can lead to inefficient merges. Search performance is affected as deleted documents still need to be skipped during searches, although the impact is generally less than the percentage of deletions. The optimize API in Elasticsearch allows for forceful space reclamation, but this can be costly and is better left to natural merging unless necessary. For time-sensitive applications, using time-based indices can provide a more efficient way to handle deletions compared to letting Lucene manage them. Overall, it's advisable to rely on Lucene's default settings for handling deletions and merges, as they generally balance performance and resource management effectively.