Speeding up SimHash by 10x using a bit hack

Company

Dynatrace

Date Published

Oct. 9, 2023

Author

Otmar Ertl

Word count

1388

Language

American English

Hacker News points

None

URL

www.dynatrace.com/news/blog/speeding-up-simhash-by-10x

Summary

The SimHash algorithm, developed by Moses Charikar in 2002, is designed to generate fingerprints of data sets that maintain similarity, enabling quick estimation of cosine similarity between sets by comparing bitwise similarities of their fingerprints. The algorithm is applicable to any objects that can be represented as sets, such as text, graphs, or GPS routes, and is used in practical applications like website deduplication and web tracking. SimHash operates by calculating a fingerprint through combining hash values of set elements, where each bit in the fingerprint reflects the majority state (set or unset) of corresponding bits in these hash values. FastSimHash, a more efficient variant, reduces the processing steps by simultaneously counting multiple bits, leading to significant performance improvements, especially for larger sets and fingerprints. Benchmarks demonstrate that FastSimHash offers up to a tenfold speed increase over traditional SimHash, owing to its optimized counting method and smaller data structures, which enhance data locality. The open-source Hash4j library provides implementations of both SimHash and FastSimHash, allowing users to explore these techniques and their applications in set similarity measurements.