Konkani LLM: Bringing a Multi-Script Low-Resource Language to the AI Era
Blog post from HuggingFace
The Konkani LLM Project aims to integrate Konkani, a low-resource Indian language with complex multi-script orthographies, into the AI ecosystem by addressing challenges such as data scarcity and script fragmentation. The initiative developed Konkani-Instruct-100k, a large-scale multi-script instruction-tuning dataset, using a synthetic generation pipeline to overcome transliteration errors. This dataset supports a "Tutor-Style" pedagogical framework, covering diverse topics and scripts, enabling fine-tuning of open-weight architectures like Gemma 3 and Llama 3.1 using Parameter-Efficient Fine-Tuning (LoRA). The project also introduced the Konkani-Bench, a benchmark for evaluating translation and transliteration across scripts, showing significant improvements in performance over base models. This work aims to elevate Konkani from its low-resource status by providing robust AI tools for learning, preserving, and translating the language, with models and datasets available on Hugging Face.