Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics
- Bastien
- 24 Mar, 2026
What is Nemotron Cascade 2?
Nemotron Cascade 2 (30B-A3B) is an open model released by NVIDIA on March 19, 2026. Its headline number is deceptive: 30 billion total parameters, but only 3 billion activated per inference pass. This is the Mixture of Experts architecture at work — the model routes each token through a small subset of its capacity, making it dramatically more efficient than a dense 30B model.
It supports two modes: thinking (extended chain-of-thought for hard problems) and instruct (fast, direct responses). On hard reasoning tasks, the thinking mode delivers results that are difficult to believe from a sub-frontier model.
Architecture and training
The training pipeline combines two techniques:
- Cascade RL — a reinforcement learning approach that progressively challenges the model with harder problems as it improves
- Multi-Domain On-Policy Distillation — the model generates its own training data under RL supervision, across mathematics, code, science, and instruction-following
The result is a model that has genuinely internalized structured problem-solving, not just pattern-matched against training examples.
Gold medals
This is the headline achievement. At the 2025 International Mathematical Olympiad and International Olympiad in Informatics, Nemotron Cascade 2 scored at gold medal level — competing against the best human students in the world.
These aren’t just benchmark numbers — IMO and IOI are the hardest math and programming competitions in the world, held annually with thousands of participants. A 30B open model reaching gold medal level is a meaningful milestone.
Full benchmark results
Mathematics
| Benchmark | Score |
|---|---|
| IMO 2025 | 35 pts (gold) |
| AIME 2025 | 92.4 (98.6 with TIR) |
| AIME 2026 | 90.9 (95.0 with TIR) |
| HMMT Feb 2025 | 94.6 |
| IMO AnswerBench | 79.3 |
Code & competitive programming
| Benchmark | Score |
|---|---|
| IOI 2025 | 439.3 pts (gold) |
| ICPC World Finals 2025 | 10/12 |
| LiveCodeBench v6 | 87.2 (88.4 with TIR) |
| SWE Verified (OpenHands) | 50.2 |
Knowledge & science
| Benchmark | Score |
|---|---|
| GPQA-Diamond | 76.1 |
| MMLU-Pro | 79.8 |
| MMLU-Redux | 86.3 |
Instruction following & alignment
| Benchmark | Score |
|---|---|
| ArenaHard v2 (avg.) | 83.5 |
| ArenaHard hard prompts | 88.2 |
| IFBench | 82.9 |
Context length
| Benchmark | Score |
|---|---|
| NIAH @ 1M tokens | 99.0 |
| LongBench v2 | 40.3 |
The NIAH (Needle In A Haystack) score of 99.0 at 1 million tokens is particularly notable — the model reliably finds information buried in a 1M-token context.
Efficiency: 3B activated out of 30B
The MoE architecture is the key to making this model practical. At inference time, only 3B parameters fire per token. This means:
| Metric | Value |
|---|---|
| Total parameters | 30B |
| Activated per token | 3B (10%) |
| Context window | 262,144 tokens |
| Tensor type | BF16 / F32 |
| Minimum setup | Single high-end GPU |
You can serve this with vllm on a single GPU with --tensor-parallel-size 1 — no multi-GPU setup required for standard use.
Dual mode operation
The model is controlled via the chat template rather than separate model weights.
Thinking mode — activates the <think> reasoning trace before answering:
prompt = tokenizer.apply_chat_template(
messages, tokenize=False,
add_generation_prompt=True,
enable_thinking=True # → <think>\n...
)
Instruct mode — skips the reasoning trace for fast responses:
prompt = tokenizer.apply_chat_template(
messages, tokenize=False,
add_generation_prompt=True,
enable_thinking=False # → <think></think>
)
Recommended sampling: temperature=1.0, top_p=0.95.
Agentic and tool use
The model natively supports Tool-Integrated Reasoning (TIR) — it can call Python code execution mid-reasoning and incorporate the result before producing its final answer. This is what drives the +TIR improvements in the benchmark scores above.
Tool calls use this format:
<tool_call>
<function=stateful_python_code_exec>
<parameter=code>import sympy; sympy.solve(...)</parameter>
</function>
</tool_call>
For agentic coding, the model integrates with OpenHands (50.2 on SWE Verified). OpenCode is not currently supported.
Use cases
Best for:
- Competitive mathematics and formal proofs
- Hard coding problems (competitive programming level)
- Long-context document analysis (up to 262K tokens)
- Agentic coding workflows via OpenHands
- Scientific reasoning (GPQA-Diamond: 76.1)
Not recommended for:
- Real-time fact retrieval (no web access)
- Deployments requiring OpenCode integration
- Memory-constrained environments without GPU
Model ecosystem
Limitations
- No OpenCode support — only OpenHands for agentic coding tasks
- Context compression in multi-turn thinking — only the summary (not the full
<think>trace) is retained in conversation history - Tool response format is non-standard — tool results go under the
userrole wrapped in<tool_response>tags, not a separatetoolrole - License is NVIDIA Open Model License, not Apache 2.0 — check terms for commercial use
Conclusion
Nemotron Cascade 2 redraws what’s possible with an efficient open model. A 3B-activated MoE winning gold at IMO and IOI is a genuine inflection point — not a benchmark cherry-pick, but a performance on the hardest public competitions that exist for mathematics and programming.
For researchers, engineers, and anyone building reasoning-heavy applications locally, this is the most capable open model in its weight class as of early 2026.
Model: nvidia/Nemotron-Cascade-2-30B-A3B — NVIDIA Open Model License
Tags :
- AI
- NVIDIA
- Reasoning
- MoE
- Open Source
- Mathematics