How to evaluate Speech Recognition models

Company

AssemblyAI

Date Published

June 15, 2023

Author

Ryan O'Connor

Word count

3816

Language

English

Hacker News points

URL

www.assemblyai.com/blog/how-to-evaluate-speech-recognition-models

Summary

In this article, we will delve into the evaluation and comparison of Speech Recognition models. Proper scientific evaluation is vital when it comes to understanding the performance of these models in real-world applications. We will discuss various aspects that contribute to a proper evaluation process, including using consistent datasets, normalizers, and metrics for comparisons. Firstly, we must ensure consistency in the dataset used for evaluation across different Speech Recognition models. This means using the same public datasets or incorporating noise into them to simulate real-world conditions. By doing so, we can eliminate any potential biases that may arise from differences in the testing data itself when comparing multiple models. Secondly, it is crucial to use a consistent normalizer for evaluating different Speech Recognition models. A normalizer is responsible for standardizing transcriptions and ensuring fair comparisons between models. Therefore, using the same open-source normalizer like Whisper's normalizer is essential when comparing multiple models. Thirdly, choosing an appropriate metric for evaluation plays a significant role in determining how well a model performs in real-world applications. While Word Error Rate (WER) provides a good measure of overall performance, it fails to capture the magnitude of errors and instead counts the number of errors. In some cases, such as when proper nouns need accurate transcription, using proper noun WER may not be sufficient either. Instead, Jaro-Winkler distance can serve as an alternative metric that offers a more fine-grained notion of similarity between two strings, thus better aligning with our preferences as humans. In summary, to ensure proper scientific evaluation and comparison of Speech Recognition models, we must utilize consistent datasets, normalizers, and metrics across all evaluations. By following these guidelines, we can more accurately assess the performance of these models in real-world applications and make informed decisions when selecting a model for specific tasks or use cases.