Breaking Brahmic: How OpenAI's Text Cleaning Hides Whisper's True Word Error Rate for Many South Asian Languages

Post Details

Company

Deepgram

Date Published

Jan. 6, 2023

Author

Ross O'Connell

Word Count

1,518

Company Posts That Month

14

Language

English

Hacker News Points

-

Source URL

deepgram.com/learn/how-openai-s-text-normalization-hides-whisper-s-true-word-error-rate-for-south-asian-and-southeast-asian-languages

Summary

OpenAI's Whisper model claims impressive word error rates for many South Asian languages, including Tamil, Hindi, and Bengali. However, upon closer examination, it appears that these results are artificially inflated due to a bug in their text cleaning process. This issue affects not only Tamil but also other languages using related writing systems, impacting over one billion speakers worldwide. Deepgram's Tamil language model outperforms OpenAI Whisper's Tamil model in accuracy, speed, and cost of use when compared on a more even playing field.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	1	292	59	28	+7%