Text Analysis for Hybrid Search: Tokenization, Stopwords & Accent Folding

Post Details

Company

Weaviate

Date Published

May 14, 2026

Author

André Mourão, Ivan Despot

Word Count

2,672

Company Posts That Month

4

Language

English

Hacker News Points

-

Post removed?

No

Source URL

weaviate.io/blog/tokenization-text-analysis-weaviate

Summary

Hybrid search in vector databases combines vector similarity for semantic understanding and BM25 for exact token matching, with tokenization playing a crucial role in determining the effectiveness of the BM25 component. Poor tokenization can lead to search failures, especially in multilingual contexts, by failing to handle language-specific nuances such as accents and non-whitespace-delimited languages. Weaviate v1.37 enhances hybrid search by making its tokenizer observable and adaptable, allowing for per-property configuration to address these issues. This includes support for accent folding, per-language stopwords, and language-specific tokenizers for non-Latin scripts, ensuring more robust and accurate search results across different languages and data types. The update also introduces REST endpoints for testing and verifying tokenization configurations without reindexing, providing a more efficient way to fine-tune search performance.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Vector Search	10	2,268	422	128	+30%
RAG	4	2,105	333	83	+124%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.