Company
Date Published
Author
Yatharth Sharma
Word count
792
Language
-
Hacker News points
None

Summary

NeuTTS-air is a 0.5 billion parameter text-to-speech (TTS) model designed to generate realistic and emotional speech as well as clone voices, although it initially operates slowly on GPUs using transformers. To enhance its performance, Yatharth Sharma optimized the model to generate audio much faster by employing the LMdeploy library, which offers advantages such as simpler installation and low latency compared to alternatives like vllm and sglang. By utilizing advanced techniques such as prefix caching and int8 cache within LMdeploy, Sharma improved batching speed and reduced VRAM usage, despite some minor quality loss. Additionally, the model's codec, neucodec, was replaced with the faster neucodec-distill, which employs more efficient encoders, leading to significant improvements in audio generation speed. Further optimizations involved splitting generated tokens into smaller groups for batch decoding, achieving a remarkable increase in end-to-end processing speed. Future enhancements are planned, including multilingual and multispeaker models, as well as online streaming capabilities, to broaden the model's applications.