Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Özay Ezerceli, Mahmud ElHuseyni 🇵🇸, SELVA TAŞ, Reyhan Bayraktar, Betül Terzioğlu, Yusuf Çelebi, Yağız Asker, and nmmursit
Word Count
3,173
Language
-
Hacker News Points
-
Summary

TurkColBERT is introduced as the first comprehensive benchmark comparing dense bi-encoders and late-interaction models specifically for Turkish information retrieval (IR). This study adapts multilingual and English encoders to Turkish through semantic fine-tuning and transforms them into ColBERT-style retrievers using PyLate and MS MARCO-TR. Across five Turkish BEIR datasets, late-interaction models consistently outperform dense baselines, with ultra-compact BERT-Hash variants showing strong performance even with minimal parameters. The integration of MUVERA indexing significantly enhances speed, making the models 3.3 times faster than PLAID, while maintaining or slightly improving retrieval precision. The evaluation highlights the advantages of late-interaction architectures, particularly in token-level matching, which is beneficial for the morphologically rich Turkish language. The study also explores the trade-offs between model size and performance, demonstrating that compact models can remain competitive, thus supporting efficient IR system deployment on resource-constrained devices. Future work aims to expand Turkish IR benchmarks and explore hybrid retrieval architectures, among other goals.