NVIDIA Nemotron-3 Super: a 120B MoE model that runs on a single GPU
- Bastien
- 20 Mar, 2026
On March 11, 2026, NVIDIA released Nemotron-3 Super — a model that sits at an unusual intersection: 120 billion total parameters, only 12 billion active during inference, deployable on a single GPU, and capable of holding a one-million-token context. It is the first model in the Nemotron 3 family trained at NVFP4 precision, and it introduces a hybrid architecture that combines Mamba-2 state space models, Mixture of Experts routing, and standard attention layers in a single system. Here is a thorough look at what this release actually delivers.
The architecture: why LatentMoE is different
The model’s architecture is described as LatentMoE — a variant of Mixture of Experts where token representations are first projected into a smaller latent dimension before being routed to the experts. This is not a cosmetic variation. Routing in latent space rather than full embedding space reduces the computational cost of the routing decision itself, and it stabilises gradient flow during training by separating the routing signal from the full representational space.
Beyond the routing design, Nemotron-3 Super is a hybrid model: it combines Mamba-2 state space layers, MoE feedforward layers, and standard attention layers in the same architecture. Each of these mechanisms has distinct strengths.
- Mamba-2 layers handle long sequences efficiently with linear-time complexity, avoiding the quadratic cost of attention over very long contexts.
- MoE feedforward layers concentrate capacity in a small number of active experts (12B out of 120B total), keeping inference cheap.
- Standard attention layers handle precise positional reasoning and short-range dependencies where attention is still the most reliable mechanism.
The combination is designed to reach 1 million tokens of effective context without the compute explosion that pure attention would require. At 128K tokens, the model scores 95.99 on RULER; at 512K tokens, it still holds 96.23. These are not headline numbers crafted for a press release — they reflect genuine architectural capability.
NVFP4: training in 4-bit precision
The NVFP4 quantization format is the other major architectural story. The majority of linear layers — weights, activations, and gradients — are trained at NVFP4 precision. Select layers that are particularly sensitive to numerical precision (latent projections, multi-token prediction layers, attention projections, embeddings) remain in BF16 or MXFP8.
This is a carefully designed hybrid precision strategy. Full 4-bit training typically introduces instability; the approach here is to identify which operations are precision-sensitive and retain higher precision only there, while maximising NVFP4 use elsewhere. The result is a model that fits in substantially less memory than a BF16 equivalent at 120B parameters — and runs on a single B200 GPU or a single DGX Spark, hardware configurations that would be impossible for a comparable dense model.
For context: running Mistral Small 4 (also a MoE model, with comparable active parameters) requires a minimum of two H200 GPUs. Nemotron-3 Super with its NVFP4 quantization reaches single-GPU deployment for workloads that previously required a multi-GPU cluster.
Multi-Token Prediction: faster inference through architecture
Nemotron-3 Super includes Multi-Token Prediction (MTP) layers with a shared-weight design. MTP is an architectural choice that predicts multiple future tokens per forward pass rather than one at a time, using shared weights across prediction heads.
The practical effects are twofold. During training, predicting multiple tokens simultaneously provides a richer gradient signal — the model learns from a denser feedback loop rather than a single next-token objective. During inference, MTP enables native speculative decoding, which pre-generates candidate continuations that can be validated in parallel rather than sequentially. This produces measurably faster inference without changing the output distribution.
The shared-weight design keeps the parameter overhead of MTP negligible, preserving the 12B active parameter budget during inference.
Configurable reasoning: on, off, or budgeted
One of the more operationally useful features of Nemotron-3 Super is the ability to control reasoning depth at inference time through a chat template parameter (enable_thinking).
Full reasoning mode (default): the model generates an explicit chain of thought before producing its final answer. This is the appropriate setting for complex mathematical, scientific, or logical tasks where structured analysis reduces hallucination risk.
No reasoning mode: the model bypasses the reasoning trace and responds directly. This is the right choice for simple queries, conversational interactions, or any setting where response latency is more important than depth.
Low-effort reasoning: a middle ground where reasoning is enabled but constrained. The model generates a shorter reasoning trace, which reduces token consumption and latency while retaining some analytical benefit for moderately complex queries.
Budget-controlled reasoning: a more surgical control mechanism that sets a hard token ceiling on the reasoning trace. The model generates reasoning up to the budget, then closes the trace gracefully and produces the final answer. This is useful when you have a latency budget in milliseconds and want to extract the maximum reasoning quality within it.
This configurability is significant for production deployments. A single Nemotron-3 Super instance can handle both simple FAQ-style requests and complex document analysis tasks, dynamically adjusting compute expenditure per query rather than requiring separate model deployments for different task types.
Benchmarks: where it performs and where it does not
NVIDIA reports benchmark results with transparency about methodology, including links to open-source evaluation tooling and reproducibility documentation.
On general knowledge, the model scores 83.33 on MMLU-Pro, positioning it competitively against frontier models in that benchmark category.
On reasoning tasks, the results are more varied. GPQA (graduate-level scientific questions, no tools) reaches 79.42. LiveCodeBench v5 scores 80.56. HMMT Feb25 (a hard competition math benchmark, with tools enabled) reaches 95.36. SciCode (a scientific coding benchmark, subtask evaluation) scores 40.83 — a number worth noting as a signal that scientific coding remains harder than general code generation for all current models.
On long-context tasks, the RULER benchmark scores (95.99 at 128K, 96.23 at 512K) are notably stable across context lengths, suggesting the hybrid architecture is genuinely handling long sequences rather than degrading.
On agentic tasks (Terminal Bench, hard subset), the score of 24.48 reflects a genuine frontier challenge: autonomous terminal operation is difficult for all current models. For enterprise agentic workflows (TauBench V2), the average of 60.46 across airline, retail, and telecom domains positions the model as a practical tool for structured task automation.
The HLE score of 17.42 (Humanity’s Last Exam, no tools) deserves mention: HLE is explicitly designed to probe the limits of current AI systems with problems at or beyond human expert level. A score of 17 is low, but it is in the same range as other state-of-the-art models on this benchmark.
Training at scale: 25 trillion tokens and three stages
The training process follows three explicit stages.
Stage 1 — pre-training: the base model was trained on more than 25 trillion tokens, drawing from web crawls (English and multilingual Common Crawl), code repositories (GitHub crawl: 747.4 billion tokens), scientific literature (arXiv, PubMed, BioRxiv), and a substantial volume of synthetic data generated using models including Qwen3-235B, DeepSeek-R1, and DeepSeek-V3. The total training corpus spans 153 datasets collected from 2013 through February 2026.
Stage 2 — supervised fine-tuning: the focus shifts to code, mathematics, science, tool calling, instruction following, and structured output generation. Special datasets were created for long-range retrieval and multi-document aggregation tasks. The Data Designer library was used for synthetic data generation at this stage.
Stage 3 — reinforcement learning: NVIDIA used asynchronous GRPO (Group Relative Policy Optimization) with fully decoupled training and inference across separate GPU clusters. In-flight weight updates and MTP acceleration were applied during RL training. A subsequent RLHF pass refined conversational quality. The training infrastructure (NeMo RL and NeMo Gym) is open-sourced, which is relevant for teams seeking to adapt the training recipe.
Hardware requirements: single-GPU is real, but conditional
The headline claim is single-GPU deployment on a B200 or DGX Spark. This is accurate for the NVFP4 model, which fits in less memory than a BF16 equivalent. For teams running on H100-80GB hardware, the model is supported but may require multi-GPU deployment depending on context length configuration.
For 1 million token context specifically, additional memory headroom is required. The NVFP4 format enables it on a single B200, but the configuration requires explicit flags (VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 or SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1).
For most enterprise deployments that do not require 1M context, standard H100-80GB configurations are sufficient. For independent developers or small teams without dedicated GPU infrastructure, NVIDIA Build provides free prototyping access.
Multilingual and programming language coverage
The model supports seven natural languages for both input and output: English, French, German, Italian, Japanese, Spanish, and Chinese. Post-training language distribution is heavily skewed toward English (13.48M samples vs. 53K for each other language), which typically produces better English-language output quality than the other supported languages — though competitive performance in French, German, and Spanish is a stated design goal given the post-training translation pairs included (43.2K translation pairs per language pair).
For programming languages, 43 languages are covered. The training corpus includes 1.09 trillion tokens of curated GitHub code plus 427.9 billion tokens of Common Crawl-sourced code — a substantial base for a wide range of software engineering use cases.
Licensing: NVIDIA’s open model license
Nemotron-3 Super is released under the NVIDIA Nemotron Open Model License, which permits commercial use. This is a more permissive position than a purely proprietary model, but it differs meaningfully from Apache 2.0 (the license used by Mistral Small 4). The NVIDIA license includes terms that restrict certain uses and requires attribution; detailed review is advisable before building revenue-generating products on the model.
The NIM container version (for production deployment via NVIDIA’s inference infrastructure) is governed by separate NVIDIA Software License Agreement terms. Teams that need the simplest possible legal situation should review both license variants before committing to a deployment path.
Deployment: vLLM and SGLang, with custom reasoning parsers
The model is deployable via vLLM and SGLang, the two dominant open-source serving frameworks. Both require custom reasoning parsers (super_v3_reasoning_parser.py for vLLM, nano_v3 for SGLang) to handle the model’s structured reasoning trace format.
Recommended inference parameters are temperature=1.0, top_p=0.95 across all task types — which is somewhat unusual (most models recommend lower temperatures for deterministic tasks), and reflects the model’s calibration toward sampling-based inference even for precise domains like code and mathematics.
NIM container deployment provides hardware-optimised inference for teams already operating within NVIDIA’s production infrastructure, with performance tuning applied out of the box.
What this means for the open model landscape
Nemotron-3 Super arrives at a moment when the distinction between “open” and “proprietary” is increasingly defined by inference cost and hardware accessibility rather than parameter count alone. A 120B-parameter model that runs on a single B200 GPU, maintains competitive benchmark performance, supports 1M-token context, and allows commercial use is a meaningful expansion of what is accessible to mid-sized organisations.
The NVFP4 training methodology is likely to influence future model releases — it demonstrates that training at 4-bit precision is viable at scale when combined with selective higher-precision layers, and it achieves single-GPU deployment without the architectural compromises that purely smaller models require.
Open questions remain: real-world performance on heterogeneous enterprise data tends to diverge from benchmark conditions, and the complexity of configuring the reasoning parser and deployment flags correctly is non-trivial. The model’s post-training quality in non-English languages beyond French and Spanish will need empirical validation in practice.
For teams evaluating AI infrastructure in 2026, Nemotron-3 Super adds a credible data point to the case that frontier-adjacent capability no longer requires frontier-level hardware investment.
Official source: NVIDIA Nemotron-3 Super on Hugging Face
Tags :
- NVIDIA
- LLM
- AI
- Nemotron
- Mixture of Experts
- Reasoning
- On device ai