Fine-tuning TTS models
Blog post from Unsloth
Unsloth now offers the capability to fine-tune Text-to-Speech (TTS) models, allowing them to adapt to specific datasets and vocal styles for applications like voice cloning and multilingual support. This enhancement also supports Speech-to-Text (STT) models such as OpenAI's Whisper and various standard TTS models like Sesame's CSM and others supported by transformers. The training process is notably efficient, being approximately 1.5 times faster and using 50% less VRAM due to the FA2 setup. Unsloth provides free Google Colab notebooks for training, running, and saving these models, with most being uploaded to Hugging Face. The process involves a dataset called 'Elise,' which includes emotion tags in transcripts to produce expressive audio. Users are encouraged to start with the Orpheus-TTS-3B model for its compatibility and ease of training, and guidance is available through Unsloth's community channels like Reddit and Discord.