Fine-tuning TTS models

Post Details

Company

Unsloth

Date Published

May 2, 2025

Author

Daniel & Michael

Word Count

416

Language

English

Hacker News Points

-

Source URL

unsloth.ai/blog/tts

Summary

Unsloth now offers the capability to fine-tune Text-to-Speech (TTS) models, allowing them to adapt to specific datasets and vocal styles for applications like voice cloning and multilingual support. This enhancement also supports Speech-to-Text (STT) models such as OpenAI's Whisper and various standard TTS models like Sesame's CSM and others supported by transformers. The training process is notably efficient, being approximately 1.5 times faster and using 50% less VRAM due to the FA2 setup. Unsloth provides free Google Colab notebooks for training, running, and saving these models, with most being uploaded to Hugging Face. The process involves a dataset called 'Elise,' which includes emotion tags in transcripts to produce expressive audio. Users are encouraged to start with the Orpheus-TTS-3B model for its compatibility and ease of training, and guidance is available through Unsloth's community channels like Reddit and Discord.