Nano-BEIR: A Multilingual Information Retrieval Benchmark with Quality-Enhanced Queries
Blog post from HuggingFace
Nano-BEIR, a multilingual information retrieval benchmark, has been introduced to address the limitations of existing datasets by covering five languages—English, Korean, Japanese, Thai, and Vietnamese—with 649 queries across 13 diverse retrieval tasks. This benchmark improves query quality by employing a two-phase preprocessing pipeline that converts informal statements into proper retrieval queries, particularly enhancing support for underrepresented languages like Thai and Vietnamese through high-quality translation. The benchmark enables a comprehensive evaluation of eight embedding models, revealing insights into language-specific performance differences and the persistent English-centric bias in training data. By providing publicly available datasets, Nano-BEIR facilitates reproducible research and supports advancements in multilingual IR systems.