Prompting voice agents to sound more realistic

Post Details

Company

LiveKit

Date Published

Feb. 26, 2026

Author

Shayne Parmelee

Word Count

1,326

Language

English

Hacker News Points

-

Source URL

blog.livekit.io/prompting-voice-agents-to-sound-more-realistic

Summary

Developers of voice AI systems often face the challenge of making their agents sound more human, prompting them to choose between a speech-to-speech (S2S) model or a cascade approach (STT-LLM-TTS). While cascaded pipelines can be fast and reliable, they often result in speech that sounds like written language read aloud, lacking the natural nuances of human conversation such as filler words, pauses, and mid-sentence corrections. To enhance realism, developers are advised to provide explicit instructions and concrete examples, illustrating natural speech patterns, including the timing of filler words and pauses. Emotion tags should be used as constraints to maintain consistency in the agent's emotional responses. Personality traits should be defined as observable speech patterns rather than adjectives, allowing for a more authentic interaction. The key lies in creating a detailed system prompt with redundancy to ensure the model internalizes these nuances, thereby reducing robotic-sounding output.