ArmBench-LLM 1.0: Benchmarking LLMs on Armenian Language Tasks
Blog post from HuggingFace
ArmBench-LLM 1.0 represents a significant advancement in benchmarking large language models (LLMs) for Armenian language tasks, following the initial release of ArmBench-LLM 0.1. Developed by Metric AI Lab, this iteration expands its scope with a larger, meticulously crafted dataset to evaluate capabilities such as text classification, multiple-choice QA, grammar correction, and translation, among others. It includes evaluations of major proprietary models and popular open-source models like Qwen and GLM. The benchmark's findings reveal that Google's Gemini 3 Flash leads in performance with an average score of 0.6350, offering a cost-effective solution compared to OpenAI's GPT-5.2 Pro, which ranks second. Notably, open-source models are closing the gap, with Qwen 3.5-27B outperforming larger models like GLM-5 and Mistral-Large. The study highlights that global model rankings don't always correlate with proficiency in Armenian, as demonstrated by the Gemini 3 family. The initiative also provides a spend report detailing the cost-effectiveness of different models, emphasizing factors such as tokenizer efficiency and reasoning verbosity. ArmBench-LLM 1.0 is open-sourced, allowing the community to explore its leaderboard, evaluation code, and dataset, while also noting limitations such as model-specific prompt sensitivity and reliability issues with certain versions.