Home / Companies / Unsloth / Blog / Post Details
Content Deep Dive

Fine-tuning TTS models

Blog post from Unsloth

Post Details
Company
Date Published
Author
Daniel & Michael
Word Count
416
Language
English
Hacker News Points
-
Source URL
Summary

Unsloth now offers the capability to fine-tune Text-to-Speech (TTS) models, allowing them to adapt to specific datasets and vocal styles for applications like voice cloning and multilingual support. This enhancement also supports Speech-to-Text (STT) models such as OpenAI's Whisper and various standard TTS models like Sesame's CSM and others supported by transformers. The training process is notably efficient, being approximately 1.5 times faster and using 50% less VRAM due to the FA2 setup. Unsloth provides free Google Colab notebooks for training, running, and saving these models, with most being uploaded to Hugging Face. The process involves a dataset called 'Elise,' which includes emotion tags in transcripts to produce expressive audio. Users are encouraged to start with the Orpheus-TTS-3B model for its compatibility and ease of training, and guidance is available through Unsloth's community channels like Reddit and Discord.