QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
Blog post from HuggingFace
QIMMA is an Arabic Language Model Leaderboard that addresses the challenges in evaluating Arabic NLP by implementing a rigorous quality validation pipeline before model evaluation. It highlights systematic quality issues in existing benchmarks and provides a unified evaluation suite covering over 52,000 samples across seven domains, ensuring 99% native Arabic content. QIMMA uniquely integrates code evaluation, applies a multi-stage validation process involving both automated assessments and human review to maintain cultural and dialectal accuracy, and releases transparent, per-sample inference outputs. The leaderboard demonstrates that model performance does not solely depend on size, as smaller, Arabic-specialized models often outperform larger, multilingual ones in specific domains.