Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Tuomas Rintamaki, Amala Sanjay Deshmukh, Nabin Mulepati, Collin McCarthy, Pritam Biswas, Arushi Goel, Leili Tavabi, Alexandre Milesi, Danial Mohseni Taheri, Kateryna Chumachenko, Isabel Hulseman, Zhehuai Chen, Karan, and Tao
Word Count
3,186
Language
-
Hacker News Points
-
Summary

NVIDIA's Nemotron 3 Nano Omni is a cutting-edge multimodal understanding model designed for comprehensive real-world document analysis, automatic speech recognition, and long audio-video understanding. It extends the capabilities of the Nemotron multimodal line by integrating text, image, video, and audio processing to achieve exceptional accuracy on document intelligence leaderboards like MMlongbench-Doc and OCRBenchV2, as well as video and audio leaderboards like WorldSense and DailyOmni. The model's architecture includes a hybrid Mamba-Transformer Mixture-of-Experts backbone with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder, allowing it to process dense images, documents, and mixed-modality reasoning efficiently. Nemotron 3 Nano Omni uses staged multimodal alignment, context extension, and reinforcement learning to enhance performance, offering up to 9x higher throughput and 2.9x faster reasoning speed compared to alternatives. Its applications span real-world document analysis, agentic computer use, and general multimodal reasoning, making it a versatile tool for complex tasks requiring the integration of visual, auditory, and textual data.