>_Reeboot
Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics
AI

Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics

NVIDIA's Nemotron Cascade 2 is a 30B MoE model with only 3B activated parameters — and it just won gold medals at the 2025 International Mathematical Olympiad and International Olympiad in Informatics

What is Nemotron Cascade 2?

Nemotron Cascade 2 (30B-A3B) is an open model released by NVIDIA on March 19, 2026. Its headline number is deceptive: 30 billion total parameters, but only 3 billion activated per inference pass. This is the Mixture of Experts architecture at work — the model routes each token through a small subset of its capacity, making it dramatically more efficient than a dense 30B model.

It supports two modes: thinking (extended chain-of-thought for hard problems) and instruct (fast, direct responses). On hard reasoning tasks, the thinking mode delivers results that are difficult to believe from a sub-frontier model.


Architecture and training

The training pipeline combines two techniques:

  • Cascade RL — a reinforcement learning approach that progressively challenges the model with harder problems as it improves
  • Multi-Domain On-Policy Distillation — the model generates its own training data under RL supervision, across mathematics, code, science, and instruction-following

The result is a model that has genuinely internalized structured problem-solving, not just pattern-matched against training examples.


Gold medals

This is the headline achievement. At the 2025 International Mathematical Olympiad and International Olympiad in Informatics, Nemotron Cascade 2 scored at gold medal level — competing against the best human students in the world.

These aren't just benchmark numbers — IMO and IOI are the hardest math and programming competitions in the world, held annually with thousands of participants. A 30B open model reaching gold medal level is a meaningful milestone.


Full benchmark results

Mathematics

Benchmark Score
IMO 2025 35 pts (gold)
AIME 2025 92.4 (98.6 with TIR)
AIME 2026 90.9 (95.0 with TIR)
HMMT Feb 2025 94.6
IMO AnswerBench 79.3

Code & competitive programming

Benchmark Score
IOI 2025 439.3 pts (gold)
ICPC World Finals 2025 10/12
LiveCodeBench v6 87.2 (88.4 with TIR)
SWE Verified (OpenHands) 50.2

Knowledge & science

Benchmark Score
GPQA-Diamond 76.1
MMLU-Pro 79.8
MMLU-Redux 86.3

Instruction following & alignment

Benchmark Score
ArenaHard v2 (avg.) 83.5
ArenaHard hard prompts 88.2
IFBench 82.9

Context length

Benchmark Score
NIAH @ 1M tokens 99.0
LongBench v2 40.3

The NIAH (Needle In A Haystack) score of 99.0 at 1 million tokens is particularly notable — the model reliably finds information buried in a 1M-token context.


Efficiency: 3B activated out of 30B

The MoE architecture is the key to making this model practical. At inference time, only 3B parameters fire per token. This means:

Metric Value
Total parameters 30B
Activated per token 3B (10%)
Context window 262,144 tokens
Tensor type BF16 / F32
Minimum setup Single high-end GPU

You can serve this with vllm on a single GPU with --tensor-parallel-size 1 — no multi-GPU setup required for standard use.


Dual mode operation

The model is controlled via the chat template rather than separate model weights.

Thinking mode — activates the <think> reasoning trace before answering:

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True   # → <think>\n...
)

Instruct mode — skips the reasoning trace for fast responses:

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # → <think></think>
)

Recommended sampling: temperature=1.0, top_p=0.95.


Agentic and tool use

The model natively supports Tool-Integrated Reasoning (TIR) — it can call Python code execution mid-reasoning and incorporate the result before producing its final answer. This is what drives the +TIR improvements in the benchmark scores above.

Tool calls use this format:

<tool_call>
<function=stateful_python_code_exec>
<parameter=code>import sympy; sympy.solve(...)</parameter>
</function>
</tool_call>

For agentic coding, the model integrates with OpenHands (50.2 on SWE Verified). OpenCode is not currently supported.


Use cases

Best for:

  • Competitive mathematics and formal proofs
  • Hard coding problems (competitive programming level)
  • Long-context document analysis (up to 262K tokens)
  • Agentic coding workflows via OpenHands
  • Scientific reasoning (GPQA-Diamond: 76.1)

Not recommended for:

  • Real-time fact retrieval (no web access)
  • Deployments requiring OpenCode integration
  • Memory-constrained environments without GPU

Model ecosystem


Limitations

  • No OpenCode support — only OpenHands for agentic coding tasks
  • Context compression in multi-turn thinking — only the summary (not the full <think> trace) is retained in conversation history
  • Tool response format is non-standard — tool results go under the user role wrapped in <tool_response> tags, not a separate tool role
  • License is NVIDIA Open Model License, not Apache 2.0 — check terms for commercial use

Conclusion

Nemotron Cascade 2 redraws what's possible with an efficient open model. A 3B-activated MoE winning gold at IMO and IOI is a genuine inflection point — not a benchmark cherry-pick, but a performance on the hardest public competitions that exist for mathematics and programming.

For researchers, engineers, and anyone building reasoning-heavy applications locally, this is the most capable open model in its weight class as of early 2026.

Model: nvidia/Nemotron-Cascade-2-30B-A3B — NVIDIA Open Model License