Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
Blog post from HuggingFace
NVIDIA's Nemotron 3 Nano Omni is a cutting-edge multimodal understanding model designed for comprehensive real-world document analysis, automatic speech recognition, and long audio-video understanding. It extends the capabilities of the Nemotron multimodal line by integrating text, image, video, and audio processing to achieve exceptional accuracy on document intelligence leaderboards like MMlongbench-Doc and OCRBenchV2, as well as video and audio leaderboards like WorldSense and DailyOmni. The model's architecture includes a hybrid Mamba-Transformer Mixture-of-Experts backbone with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder, allowing it to process dense images, documents, and mixed-modality reasoning efficiently. Nemotron 3 Nano Omni uses staged multimodal alignment, context extension, and reinforcement learning to enhance performance, offering up to 9x higher throughput and 2.9x faster reasoning speed compared to alternatives. Its applications span real-world document analysis, agentic computer use, and general multimodal reasoning, making it a versatile tool for complex tasks requiring the integration of visual, auditory, and textual data.