LLM based TTS models

Post Details

Company

HuggingFace

Date Published

Dec. 18, 2025

Author

Yatharth Sharma

Word Count

871

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/YatharthS/llm-tts-models

Summary

Text-to-Speech (TTS) models are becoming increasingly popular, and recent advancements have simplified their architectures by using a two-part system composed of a Large Language Model (LLM) and a neural codec. This system allows for high-quality TTS and other tasks like Automatic Speech Recognition (ASR) with excellent scalability. The neural codec compresses audio into discrete tokens, which the LLM then uses to generate speech from text by treating audio as a new "language." This approach offers advantages such as scalability, multimodality, and simplification by eliminating the need for phonemes. Different neural codecs vary in characteristics like tokens per second, codebook size, and sampling rates, influencing the speed and quality of audio processing. This innovative use of LLMs in TTS models streamlines the process and enhances efficiency, paving the way for future advancements in audio-based AI applications.