Lessons from Building an AI Football Commentator
Blog post from Stream
Vision Agents is an open-source framework designed to facilitate the development of low-latency video AI applications on the edge, leveraging Stream's global edge network and integrating with a variety of leading voice and video AI models. An experiment was conducted using this framework to create a real-time sports commentator from stock football footage, utilizing Roboflow's RF-DETR for player identification and real-time models from Google Gemini and OpenAI for commentary. However, the models struggled with accuracy and speed necessary for live sports, and improvements were sought through various configurations and enhancements, including the use of SAM3 for more detailed object detection. Despite these efforts, both models were unable to reliably track fast action or maintain context, highlighting current limitations in real-time video AI applications. The experiment underscores the challenges faced by real-time models in high-motion scenarios, while suggesting future enhancements to improve their performance.