Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

When Speech AI Meets the Long Tail of Languages: Inside the VAANI Dataset

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Sujith Pulikodan, Sanka, Nihar Desai, Suryansh Shukla, and Prasanta Kumar Ghosh
Word Count
901
Language
-
Hacker News Points
-
Summary

The VAANI dataset, developed by ARTPARK at the Indian Institute of Science, aims to address the limitations of current automatic speech recognition (ASR) models by focusing on linguistic diversity and geographic representation, particularly in India. Unlike traditional datasets that often overrepresent urban areas and standardized language forms, VAANI employs a district-wise data collection approach, capturing speech across 165 districts and covering 109 languages, 59 of which are absent from existing datasets. This methodology ensures the inclusion of regional accents, dialectal shifts, and socio-linguistic diversity, making VAANI a significant resource for multilingual and low-resource speech research. With over 31,255 hours of audio and 156,534 speakers, VAANI also includes nearly 300,000 images to facilitate multimodal learning, highlighting the depth of linguistic diversity and the importance of geography in language variation. The dataset not only fills gaps in language representation and speaker diversity but also challenges the limitations of traditional linguistic inventories by documenting underrepresented and previously uncaptured languages.