How to Improve Retrieval Quality for Japanese Text with Sudachi, Milvus/Zilliz, and AWS Bedrock
Blog post from Zilliz
Eisuke Izawa's article explores the challenges and solutions for improving retrieval quality in Japanese text using a hybrid search system that combines Sudachi for normalization, Zilliz Cloud's Milvus for vector storage, and AWS Bedrock for dense embeddings. The hybrid search pipeline addresses the language's orthographic variations and mixed scripts by integrating dense vector search for semantic similarity and keyword-based BM25 methods for exact matches. The system employs Reciprocal Rank Fusion (RRF) to merge results, ensuring accuracy and ease of use. The tutorial provided allows users to replicate the setup using Zilliz Cloud's free serverless tier and AWS Bedrock, demonstrating its applicability to scenarios such as internal policy searches and e-commerce product retrieval. This approach showcases robust retrieval capabilities while maintaining low operational overhead by leveraging Milvus's built-in functions for sparse vector generation.