How to build fast similarity search for email from the ground up
Blog post from Sublime Security
Sublime's platform employs an advanced method of grouping similar email messages to enhance the analyst experience and improve system efficiency. By using set-based representations and techniques like tokenization, shingling, and MinHash, Sublime efficiently processes large volumes of messages, identifying highly similar ones in milliseconds. This process involves representing messages as sets of smaller fragments, utilizing the Jaccard index for calculating similarity, and leveraging MinHash for efficient approximation of set similarity. The banding technique further optimizes search processes by reducing the search space for messages with high similarity. This approach allows for fast retrieval and clustering of messages, supporting real-time grouping and enabling powerful features such as cascading remediation and live clustering in high-scale environments. The platform's ability to manage these processes efficiently is a testament to its sophisticated engineering and mathematical foundations, making it a robust tool for message processing and management.