Open Vision Agents by Stream: Open Source SDK for Building Low-Latency Vision AI Apps
Blog post from Stream
Vision Agents is an open-source framework developed by Stream that facilitates the creation of low-latency vision AI applications, with a focus on real-time voice and video models such as OpenAI Realtime and Gemini Live. It offers a simple integration process through a generic Agent class that manages the complexities of tracks, video subscriptions, and response type conversions. The framework supports various models including text-to-speech, speech-to-text, and speech-to-speech, allowing developers to incorporate their preferred language learning models (LLMs). Vision Agents is built video-first, prioritizing real-time video processing via WebRTC, and provides customizable video processors for tasks like pose detection and anomaly detection in manufacturing. The framework supports diverse applications such as sports coaching, meeting assistance, and accessibility features, while also enabling integration with robotics and IoT. Its design allows for natural interactions by combining visual and auditory data processing, and offers built-in memory and context retention across sessions. The project encourages community involvement and collaboration with AI companies to expand its support for various AI models and services.