ArmBench-LLM 1.0: Benchmarking LLMs on Armenian Language Tasks

Post Details

Company

Hugging Face

Date Published

April 2, 2026

Author

Hrant Davtyan, Zaruhi Navasardyan, Spartak Bughdaryan, and bag_min

Word Count

1,205

Company Posts That Month

61

Language

-

Hacker News Points

-

Post removed?

No

Source URL

huggingface.co/blog/Metric-AI/armbench-llm

Summary

ArmBench-LLM 1.0 represents a significant advancement in benchmarking large language models (LLMs) for Armenian language tasks, following the initial release of ArmBench-LLM 0.1. Developed by Metric AI Lab, this iteration expands its scope with a larger, meticulously crafted dataset to evaluate capabilities such as text classification, multiple-choice QA, grammar correction, and translation, among others. It includes evaluations of major proprietary models and popular open-source models like Qwen and GLM. The benchmark's findings reveal that Google's Gemini 3 Flash leads in performance with an average score of 0.6350, offering a cost-effective solution compared to OpenAI's GPT-5.2 Pro, which ranks second. Notably, open-source models are closing the gap, with Qwen 3.5-27B outperforming larger models like GLM-5 and Mistral-Large. The study highlights that global model rankings don't always correlate with proficiency in Armenian, as demonstrated by the Gemini 3 family. The initiative also provides a spend report detailing the cost-effectiveness of different models, emphasizing factors such as tokenizer efficiency and reasoning verbosity. ArmBench-LLM 1.0 is open-sourced, allowing the community to explore its leaderboard, evaluation code, and dataset, while also noting limitations such as model-specific prompt sensitivity and reliability issues with certain versions.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	12	5,932	1,046	223	-2%
Vector Search	1	1,739	413	146	-27%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.