Cost-efficient, high-performance TTS with Qwen3-TTS
Blog post from Baseten
Voice technology is increasingly prominent in interacting with large language model (LLM) systems, and the Qwen3-TTS model family, optimized by vLLM-Omni, provides cost-effective and high-performance text-to-speech (TTS) solutions for various applications such as voice agents, language learning, and enterprise call infrastructure. These models achieve significant cost efficiencies, approximately $3-$4 per million characters, which is notably less expensive than many closed-source alternatives. The model's architecture allows for separate processing stages for generating and decoding acoustic tokens, enhancing concurrency and reducing costs. The system is designed to handle multiple requests simultaneously, improving throughput while maintaining high-quality voice output. Additionally, enhancements such as dynamic frame accumulation, speaker embedding caching, and word timestamps further optimize performance, making Qwen3-TTS suitable for real-time applications and voice cloning. Baseten's implementation offers these capabilities at a fraction of the cost compared to other providers, with ongoing updates to support the open-source community and facilitate custom voice fine-tuning.