Nemotron Cascade 2 30B-A3B: Gold medals at IMO and IOI 2025

What is Nemotron Cascade 2?

Nemotron Cascade 2 (30B-A3B) is an open model released by NVIDIA on March 19, 2026. Its headline number is deceptive: 30 billion total parameters, but only 3 billion activated per inference pass. This is the Mixture of Experts architecture at work — the model routes each token through a small subset of its capacity, making it dramatically more efficient than a dense 30B model.

It supports two modes: thinking (extended chain-of-thought for hard problems) and instruct (fast, direct responses). On hard reasoning tasks, the thinking mode delivers results that are difficult to believe from a sub-frontier model.

Architecture and training

The training pipeline combines two techniques:

Cascade RL — a reinforcement learning approach that progressively challenges the model with harder problems as it improves
Multi-Domain On-Policy Distillation — the model generates its own training data under RL supervision, across mathematics, code, science, and instruction-following

The result is a model that has genuinely internalized structured problem-solving, not just pattern-matched against training examples.

Gold medals

This is the headline achievement. At the 2025 International Mathematical Olympiad and International Olympiad in Informatics, Nemotron Cascade 2 scored at gold medal level — competing against the best human students in the world.

These aren't just benchmark numbers — IMO and IOI are the hardest math and programming competitions in the world, held annually with thousands of participants. A 30B open model reaching gold medal level is a meaningful milestone.

Full benchmark results

Mathematics

Benchmark	Score
IMO 2025	35 pts (gold)
AIME 2025	92.4 (98.6 with TIR)
AIME 2026	90.9 (95.0 with TIR)
HMMT Feb 2025	94.6
IMO AnswerBench	79.3

Code & competitive programming

Benchmark	Score
IOI 2025	439.3 pts (gold)
ICPC World Finals 2025	10/12
LiveCodeBench v6	87.2 (88.4 with TIR)
SWE Verified (OpenHands)	50.2

Knowledge & science

Benchmark	Score
GPQA-Diamond	76.1
MMLU-Pro	79.8
MMLU-Redux	86.3

Instruction following & alignment

Benchmark	Score
ArenaHard v2 (avg.)	83.5
ArenaHard hard prompts	88.2
IFBench	82.9

Context length

Benchmark	Score
NIAH @ 1M tokens	99.0
LongBench v2	40.3

The NIAH (Needle In A Haystack) score of 99.0 at 1 million tokens is particularly notable — the model reliably finds information buried in a 1M-token context.

Efficiency: 3B activated out of 30B

The MoE architecture is the key to making this model practical. At inference time, only 3B parameters fire per token. This means:

Metric	Value
Total parameters	30B
Activated per token	3B (10%)
Context window	262,144 tokens
Tensor type	BF16 / F32
Minimum setup	Single high-end GPU

You can serve this with vllm on a single GPU with --tensor-parallel-size 1 — no multi-GPU setup required for standard use.

Dual mode operation

The model is controlled via the chat template rather than separate model weights.

Thinking mode — activates the <think> reasoning trace before answering:

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True   # → <think>\n...
)

Instruct mode — skips the reasoning trace for fast responses:

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # → <think></think>
)

Recommended sampling: temperature=1.0, top_p=0.95.

Agentic and tool use

The model natively supports Tool-Integrated Reasoning (TIR) — it can call Python code execution mid-reasoning and incorporate the result before producing its final answer. This is what drives the +TIR improvements in the benchmark scores above.

Tool calls use this format:

<tool_call>
<function=stateful_python_code_exec>
<parameter=code>import sympy; sympy.solve(...)</parameter>
</function>
</tool_call>

For agentic coding, the model integrates with OpenHands (50.2 on SWE Verified). OpenCode is not currently supported.

Use cases

Best for:

Competitive mathematics and formal proofs
Hard coding problems (competitive programming level)
Long-context document analysis (up to 262K tokens)
Agentic coding workflows via OpenHands
Scientific reasoning (GPQA-Diamond: 76.1)

Not recommended for:

Real-time fact retrieval (no web access)
Deployments requiring OpenCode integration
Memory-constrained environments without GPU

Model ecosystem

Limitations

No OpenCode support — only OpenHands for agentic coding tasks
Context compression in multi-turn thinking — only the summary (not the full <think> trace) is retained in conversation history
Tool response format is non-standard — tool results go under the user role wrapped in <tool_response> tags, not a separate tool role
License is NVIDIA Open Model License, not Apache 2.0 — check terms for commercial use

Conclusion

Nemotron Cascade 2 redraws what's possible with an efficient open model. A 3B-activated MoE winning gold at IMO and IOI is a genuine inflection point — not a benchmark cherry-pick, but a performance on the hardest public competitions that exist for mathematics and programming.

For researchers, engineers, and anyone building reasoning-heavy applications locally, this is the most capable open model in its weight class as of early 2026.

Model: nvidia/Nemotron-Cascade-2-30B-A3B — NVIDIA Open Model License

Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics