easytranscriber: Speech Recognition with Accurate Timestamps in the HF Ecosystem
Blog post from HuggingFace
Easytranscriber, developed by KBLab at the National Library of Sweden, is an automatic speech recognition (ASR) library focused on efficient transcription with precise word-level timestamps. By drawing inspiration from the WhisperX library, easytranscriber achieves speed improvements of 35% to 102%, attributed to its optimized GPU-accelerated forced alignment, parallel audio file loading, and batch processing for wav2vec2 models. The library supports both ctranslate2 and Hugging Face transformers as backends, integrating WhisperX functionality into the Hugging Face ecosystem. Its pipeline consists of voice activity detection, transcription, emission extraction, and forced alignment stages, which can be run sequentially or independently. Easytranscriber also features a search interface called easysearch, which enables users to browse and query transcription outputs with synchronized audio playback. The library is particularly beneficial for large-scale projects like the mass transcription of archival radio recordings, offering significant performance enhancements over traditional ASR libraries by reducing inefficiencies in data loading and alignment processes.