Text Analysis for Hybrid Search: Tokenization, Stopwords & Accent Folding
Blog post from Weaviate
Hybrid search in vector databases combines vector similarity for semantic understanding and BM25 for exact token matching, with tokenization playing a crucial role in determining the effectiveness of the BM25 component. Poor tokenization can lead to search failures, especially in multilingual contexts, by failing to handle language-specific nuances such as accents and non-whitespace-delimited languages. Weaviate v1.37 enhances hybrid search by making its tokenizer observable and adaptable, allowing for per-property configuration to address these issues. This includes support for accent folding, per-language stopwords, and language-specific tokenizers for non-Latin scripts, ensuring more robust and accurate search results across different languages and data types. The update also introduces REST endpoints for testing and verifying tokenization configurations without reindexing, providing a more efficient way to fine-tune search performance.