Home / Companies / Deepinfra / Blog / Post Details
Content Deep Dive

NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost

Blog post from Deepinfra

Post Details
Company
Date Published
Author
Deep
Word Count
1,697
Language
English
Hacker News Points
-
Summary

NVIDIA's Nemotron 3 Super 120B is a large language model released in 2026, boasting 120 billion parameters, with only 12 billion active per inference pass, which enhances efficiency in complex applications like software development and cybersecurity. It employs a hybrid Mamba2-Transformer LatentMoE architecture with Multi-Token Prediction, achieving over five times the throughput of its predecessor and supporting a 1 million token context window. The analysis of Nemotron 3 Super's API providers highlights DeepInfra as the most cost-effective choice, offering a price of $0.20 per million tokens and competitive performance metrics, including strong throughput (459.3 tokens/sec) and latency (1.01 seconds). While Baseten is ideal for latency-sensitive applications and Lightning AI excels in throughput, DeepInfra is recommended for its balanced performance and low cost, making it suitable for production-scale deployments.