How to make NeuTTS-air generate over 200 seconds of audio in a single second.

Post Details

Company

HuggingFace

Date Published

Nov. 21, 2025

Author

Yatharth Sharma

Word Count

792

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/YatharthS/making-neutts-200x-realtime

Summary

NeuTTS-air is a 0.5 billion parameter text-to-speech (TTS) model designed to generate realistic and emotional speech as well as clone voices, although it initially operates slowly on GPUs using transformers. To enhance its performance, Yatharth Sharma optimized the model to generate audio much faster by employing the LMdeploy library, which offers advantages such as simpler installation and low latency compared to alternatives like vllm and sglang. By utilizing advanced techniques such as prefix caching and int8 cache within LMdeploy, Sharma improved batching speed and reduced VRAM usage, despite some minor quality loss. Additionally, the model's codec, neucodec, was replaced with the faster neucodec-distill, which employs more efficient encoders, leading to significant improvements in audio generation speed. Further optimizations involved splitting generated tokens into smaller groups for batch decoding, achieving a remarkable increase in end-to-end processing speed. Future enhancements are planned, including multilingual and multispeaker models, as well as online streaming capabilities, to broaden the model's applications.