Home / Companies / LiveKit / Blog / Post Details
Content Deep Dive

Prompting voice agents to sound more realistic

Blog post from LiveKit

Post Details
Company
Date Published
Author
Shayne Parmelee
Word Count
1,326
Language
English
Hacker News Points
-
Summary

Developers of voice AI systems often face the challenge of making their agents sound more human, prompting them to choose between a speech-to-speech (S2S) model or a cascade approach (STT-LLM-TTS). While cascaded pipelines can be fast and reliable, they often result in speech that sounds like written language read aloud, lacking the natural nuances of human conversation such as filler words, pauses, and mid-sentence corrections. To enhance realism, developers are advised to provide explicit instructions and concrete examples, illustrating natural speech patterns, including the timing of filler words and pauses. Emotion tags should be used as constraints to maintain consistency in the agent's emotional responses. Personality traits should be defined as observable speech patterns rather than adjectives, allowing for a more authentic interaction. The key lies in creating a detailed system prompt with redundancy to ensure the model internalizes these nuances, thereby reducing robotic-sounding output.