ATE-2: State-of-the-Art Armenian Text Embeddings and the ArmBench-TextEmbed Benchmark
Blog post from HuggingFace
The ATE-2 (Armenian Text Embeddings 2) models challenge the assumption that high-quality or massive datasets are necessary for effective text embedding in low-resource languages (LRLs) by demonstrating significant improvements using just 10,000 noisy synthetic data pairs. These models, released alongside the ArmBench-TextEmbed benchmark, show that fine-tuning a multilingual encoder on small-scale data can yield substantial performance gains, rivaling models trained on much larger datasets. The ATE-2 models also effectively handle both native and transliterated Armenian queries, outperforming other leading models in semantic alignment tasks. This approach not only democratizes access to high-performance embeddings for LRLs but also provides a framework for other resource-constrained communities to develop their own text embedding solutions.