Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
Summary
NVIDIA has released Nemotron 3 Nano Omni, an omni-modal model designed for integrated processing of text, image, video, and audio. The model is optimized for long-context workloads, including complex document analysis, automatic speech recognition (ASR), and agentic computer use within GUI environments.
Key Points
- The model utilizes a hybrid Mamba-Transformer Mixture-of-Experts (MoE) backbone (30B-A3B) consisting of 23 Mamba selective state-space layers, 23 MoE layers (128 experts, top-6 routing), and 6 grouped-query attention layers.
- Vision processing employs dynamic resolution at native aspect ratios, utilizing between 1,024 and 13,312 visual patches per image to handle high-resolution documents and screenshots.
- Audio capabilities include native processing of 16 kHz audio via the Parakeet-TDT-0.6B-v2 encoder, supporting training inputs up to 1,200 seconds and a maximum context length of over 5 hours.
- Video efficiency is achieved through a Conv3D tubelet embedding path that fuses consecutive frames and an Efficient Video Sampling (EVS) mechanism that prunes redundant static tokens.
- Benchmarks show the model achieving 65.8 on OCRBenchV2-En, 72.2 on Video-MME, and 89.4 on VoiceBench.
- The model delivers up to 9x higher throughput for multimodal use cases and 7.4x higher system efficiency for multi-document tasks compared to comparable open-weights models.
- Checkpoints are available on HuggingFace in BF16, FP8, and NVFP4 formats.
Technical Details
The architecture follows a unified encoder-projector-decoder design. The language backbone is paired with a C-RADIOv4-H vision encoder and a Parakeet-TDT-0.6B-v2 audio encoder, both of which connect to the LLM via lightweight 2-layer MLP projectors. This allows for the interleaving of vision, audio, and text tokens within a shared embedding space. For video processing, the Conv3D path reduces the number of vision tokens by fusing pairs of consecutive frames, while the EVS feature further optimizes inference by dropping tokens in areas of the video where no motion or change is detected.
The training and optimization pipeline utilized NVIDIA H100 and B200 clusters, employing Megatron-LM, Transformer Engine, and Megatron Energon. The training process involved staged multimodal alignment, context extension, and reinforcement learning using NeMo-RL and NeMo Gym. The infrastructure supported advanced techniques such as tensor, expert, sequence, and context parallelism, alongside online sequence packing and selective activation recomputation.
Impact / Why It Matters
Developers can implement highly efficient, long-context agents capable of reasoning across interleaved audio-visual streams, such as analyzing narrated screen recordings or processing 100+ page technical documents. The availability of FP8 and NVFP4 weights enables high-throughput deployment on modern NVIDIA hardware for large-scale automated workflows.