Home / Companies / Sublime Security / Blog / Post Details
Content Deep Dive

How to build fast similarity search for email from the ground up

Blog post from Sublime Security

Post Details
Date Published
Author
Ross Wolf
Word Count
2,932
Language
English
Hacker News Points
-
Summary

Sublime's platform employs an advanced method of grouping similar email messages to enhance the analyst experience and improve system efficiency. By using set-based representations and techniques like tokenization, shingling, and MinHash, Sublime efficiently processes large volumes of messages, identifying highly similar ones in milliseconds. This process involves representing messages as sets of smaller fragments, utilizing the Jaccard index for calculating similarity, and leveraging MinHash for efficient approximation of set similarity. The banding technique further optimizes search processes by reducing the search space for messages with high similarity. This approach allows for fast retrieval and clustering of messages, supporting real-time grouping and enabling powerful features such as cascading remediation and live clustering in high-scale environments. The platform's ability to manage these processes efficiently is a testament to its sophisticated engineering and mathematical foundations, making it a robust tool for message processing and management.