NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost

Post Details

Company

Deepinfra

Date Published

April 3, 2026

Author

Deep

Word Count

1,697

Company Posts That Month

34

Language

English

Hacker News Points

-

Post removed?

No

Source URL

deepinfra.com/blog/nvidia-nemotron-3-super-120b-api-benchmarks

Summary

NVIDIA's Nemotron 3 Super 120B is a large language model released in 2026, boasting 120 billion parameters, with only 12 billion active per inference pass, which enhances efficiency in complex applications like software development and cybersecurity. It employs a hybrid Mamba2-Transformer LatentMoE architecture with Multi-Token Prediction, achieving over five times the throughput of its predecessor and supporting a 1 million token context window. The analysis of Nemotron 3 Super's API providers highlights DeepInfra as the most cost-effective choice, offering a price of $0.20 per million tokens and competitive performance metrics, including strong throughput (459.3 tokens/sec) and latency (1.01 seconds). While Baseten is ideal for latency-sensitive applications and Lightning AI excels in throughput, DeepInfra is recommended for its balanced performance and low cost, making it suitable for production-scale deployments.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	4	5,932	1,046	223	-2%
Real-time	3	6,296	1,346	246	-2%
AI Agents	1	4,430	1,100	236	-3%
Multi-agent systems	1	460	170	68	-20%
RAG	1	941	216	85	-48%
Vector Search	1	1,739	413	146	-27%
Voice AI	1	2,379	221	38	-3%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.