Retrieval-Augmented Generation (RAG) is expanding beyond text to include audio and video, as demonstrated by Ragie's enterprise-grade RAG-as-a-service platform. This platform now supports audio and video by addressing the unique challenges of multimodal data, including preprocessing, data extraction, chunking, and indexing. Ragie utilizes a sophisticated pipeline to convert media files into searchable, semantically meaningful chunks, enriched with metadata like timestamps and source links. For audio, the platform employs faster-whisper for transcription, while video processing leverages Vision LLM, using Google's Gemini models for detailed scene descriptions. This approach enhances the retrieval performance, speed, and cost-effectiveness compared to native multimodal embeddings. The platform also extends document-level summaries to audio and video, devising methods for cohesive video summaries. Ragie's system supports media streaming and retrieval with APIs that provide links for streaming or downloading specific segments, enabling applications to highlight relevant media parts. This advancement unlocks new use cases across various industries, allowing users to search and analyze video and audio content efficiently, thereby accessing previously untapped information.