The End of the Orb: Building AI Agents That Feel Present
Blog post from Stream
A new open-source conversational agent has been developed to address the limitations of current voice agents, which often lack visual engagement and emotional awareness. This innovative agent uses Vision Agents for orchestration, Inworld's expressive TTS-2 for voice modulation, Anam for a lip-synced avatar, MediaPipe for face tracking, Gemini for the language model, and Deepgram for speech-to-text conversion, all operating in real-time over Stream's edge network. By integrating facial emotion, gaze, and engagement detection, the agent adapts its responses to reflect the user's emotional state, providing a more personal and interactive experience. This technology has potential applications in various fields, such as interview coaching, education, and customer support, where real-time emotional feedback can enhance the interaction. The system's modular design allows for flexibility and scalability, making it a versatile tool for developing emotionally intelligent agents that engage users more naturally and effectively.