Create Speech-to-Text Experiences with ElevenLabs Scribe v2 Realtime & Vision Agents
Blog post from Stream
ElevenLabs has unveiled Scribe v2 Realtime, a speech-to-text model known for its impressively low latency of approximately 150 milliseconds and support for over 90 languages, boasting the lowest Word Error Rate in several benchmarks. This model is tailored for applications such as live meetings, note-taking, and conversational AI, where real-time accuracy is crucial. Scribe v2 Realtime can transcribe both user speech and agent responses in real-time, providing seamless conversations without noticeable lag. The model's setup involves a tech stack that includes ElevenLabs' solutions for STT and TTS, along with the Gemini LLM and Vision Agents framework, requiring API keys from ElevenLabs, Google AI Studio, and Stream. The open-source Vision Agents framework facilitates easy integration of Scribe v2 Realtime for applications needing precise live captioning and understanding, making it ideal for voice AI applications.