Customer Case Study: Building an end-to-end Speech Recognition model in PyTorch with AssemblyAI

Company

Comet

Date Published

May 18, 2020

Author

Nikolas Laskaris

Word count

3483

Language

English

Hacker News points

URL

www.comet.ml/site/customer-case-study-building-an-end-to-end-speech-recognition-model-in-pytorch-with-assemblyai

Summary

Michael Nguyen, a Machine Learning Research Engineer at AssemblyAI, explores the evolution and implementation of end-to-end deep learning models for speech recognition, focusing on building a model using PyTorch. The article highlights the transition from traditional models to those like Deep Speech and Listen Attend Spell (LAS), which simplify processes by leveraging large datasets. Nguyen provides a detailed walkthrough on constructing a speech recognition model inspired by Baidu's Deep Speech 2, incorporating improvements such as Residual CNNs for feature extraction and BiRNNs for sequence modeling, along with using torchaudio for data handling and SpecAugment for data augmentation. The model employs AdamW optimization and a One Cycle Learning Rate Scheduler for efficient training, and it uses the CTC loss function for aligning audio and transcript data. Evaluation is done using metrics like Word Error Rate (WER) and Character Error Rate (CER), with Comet.ml recommended for tracking and optimizing model experiments. The post also discusses recent advancements in speech recognition, including the adoption of transformers and unsupervised pre-training techniques, suggesting that these innovations hold promise for further improving model accuracy and efficiency.