MiMo-V2.5-Pro: 1.02T parameters, MIT-licensed agent powerhouse
- Bastien
- 01 May, 2026
From V2-Pro to V2.5-Pro: the long-context breakthrough
XiaoMi’s MiMo family has rapidly positioned itself among the top open-weight models. MiMo-V2.5-Pro is the latest iteration — a 1.02 trillion-parameter Mixture of Experts model with 42B active parameters that significantly extends what was possible in MiMo-V2-Pro. The headline capability is not just scale, but long-context persistence at 1M tokens combined with agentic reasoning trained via multi-teacher on-policy distillation.
Where MiMo-V2-Pro proved the architecture could handle long sequences, V2.5-Pro proves it can retain and apply information across a full million tokens while maintaining agentic reasoning. On long-context graph navigation benchmarks, V2.5-Pro scores 0.62 on the Parents task at 1M tokens (vs. V2-Pro collapsing to 0 at that length), a qualitative leap in context retention.
Architecture: MoE with Hybrid Attention and Multi-Token Prediction
MiMo-V2.5-Pro uses an 70-layer architecture (1 dense + 69 MoE) with a carefully structured attention system and novel prediction heads.
Hybrid Attention (SWA + GA, 6:1 ratio).
60 layers use Sliding Window Attention (SWA) with a 128-token window, handling local context efficiently. The remaining 10 layers use Global Attention (GA) to capture long-range dependencies. The 6:1 ratio was found optimal during pre-training — enough global access to prevent information loss across 1M tokens, while the sliding window keeps per-layer compute manageable.
Mixture of Experts (384 experts, 8 routed per token).
The model has 384 total experts with 8 active per token, yielding 42B active parameters out of 1.02T total. The routing network distributes tokens across experts dynamically, with each expert specialising in different linguistic or reasoning patterns discovered during the 27T-token pre-training phase.
Multi-Token Prediction (3 layers).
Three MTP heads are attached to intermediate layers, allowing the model to predict the next 3 tokens simultaneously during training. This accelerates convergence and improves the quality of the learned representations without adding inference overhead, as MTP is only used during pre-training.
FP8 mixed-precision training.
MiMo-V2.5-Pro was trained in FP8 mixed precision, a first at this model scale. FP8 reduces memory bandwidth requirements and increases throughput during training without noticeable accuracy loss — the 27T-token pre-training phase ran at 32K sequence length using FP8, with a switch to BF16 for the final post-training stages.
Benchmark results
Base model evaluation (post pre-training, before fine-tuning)
| Benchmark | MiMo-V2.5-Pro Base | MiMo-V2-Pro Base |
|---|---|---|
| BBH | 88.4 | 85.1 |
| MMLU | 89.4 | 87.3 |
| MMLU-Pro | 68.5 | 64.2 |
| MATH | 86.2 | 83.8 |
| HumanEval+ | 75.6 | 71.3 |
| SWE-Bench (AgentLess) | 35.7 | 31.2 |
| C-Eval | 91.5 | 89.8 |
The base model already shows a strong gap over V2-Pro across all categories, reflecting the benefit of 27T pre-training tokens and FP8 training efficiency.
Post-training evaluation (SFT + agentic RL + MOPD)
| Benchmark | MiMo-V2.5-Pro | MiMo-V2-Pro |
|---|---|---|
| SWE-Bench Pro | 57.2% | — |
| SWE-Bench Verified | 78.9% | 74.1% |
| TerminalBench 2 | 68.4% | 59.3% |
| GPQA-Diamond | 66.7% | 63.5% |
| GSM8K | 99.6% | 98.7% |
The post-training gains are substantial — particularly on SWE-Bench Pro (+~4 points) and TerminalBench 2 (+9.1 points). The Multi-Teacher On-Policy Distillation (MOPD) plays a key role here: multiple teacher models guide the student through on-policy trajectories, teaching it to reason more systematically in agentic scenarios.
Long-context benchmark
V2.5-Pro’s defining achievement is preserving information at extreme context lengths.
| Benchmark | MiMo-V2.5-Pro (1M) | MiMo-V2.5 (1M) | MiMo-V2-Pro (collapse) |
|---|---|---|---|
| GraphWalks BFS | 0.37 | 0.31 | 0.00 |
| GraphWalks Parents | 0.62 | 0.48 | 0.00 |
At 1M tokens, V2-Pro completely collapses (0.00 on both tasks), while V2.5 retains meaningful ability and V2.5-Pro achieves the strongest scores. This is not a marginal improvement — it is the difference between a model that can reason across a 2000-page document and one that cannot.
What makes it different: long-context persistence + agentic capability
Two capabilities set MiMo-V2.5-Pro apart from competitors:
1. Long-context that actually works.
Hybrid Attention with the 6:1 SWA/GA ratio and MTP training together enable information to travel across 1M tokens. Most models lose context coherence well before 256K. V2.5-Pro was explicitly evaluated and validated on graph navigation tasks spanning millions of tokens — a task that requires remembering relationships across an arbitrarily long sequence.
2. Agentic reasoning via MOPD.
Post-training with Multi-Teacher On-Policy Distillation gives V2.5-Pro the ability to plan, execute multi-step operations (code, search, browse), and recover from errors autonomously. The agentic RL loop taught the model to maintain strategy across hundreds of action steps, which directly translates to SWE-Bench Pro scores of 57.2%.
Comparison with MiMo-V2.5 and MiMo-V2-Pro
| Dimension | MiMo-V2-Pro | MiMo-V2.5 | MiMo-V2.5-Pro |
|---|---|---|---|
| Total params | 1.02T | 1.02T | 1.02T |
| Active params | 42B | 42B | 42B |
| Architecture | MoE + Hybrid Attn | MoE + Hybrid Attn | MoE + Hybrid Attn + MTP |
| Layers | 69 | 69 | 70 (1 dense + 69 MoE) |
| Experts | 384 / 8 | 384 / 8 | 384 / 8 |
| Context | 128K | 512K | 1M |
| Pre-training tokens | 14T | 20T | 27T |
| Pre-training precision | FP16 | BF16 | FP8 → BF16 |
| Post-training | SFT | SFT + RL | SFT + Agentic RL + MOPD |
| SWE-Bench Pro | — | — | 57.2% |
| GraphWalks Parents @ 1M | 0.00 | 0.00 | 0.62 |
| License | MIT | MIT | MIT |
V2.5-Pro is the only model in the family with MTP training, the full 1M context validation, and the MOPD post-training pipeline.
Deployment
MiMo-V2.5-Pro can be deployed via two primary inference frameworks:
- SGLang (latest) with EAGLE speculative decoding for accelerated inference
- vLLM (latest) for high-throughput serving
An API is available at platform.xiaomimimo.com.
For local deployment with EAGLE (speculative decoding), SGLang provides significant speedups by using a smaller draft model to propose tokens, which the V2.5-Pro verifier accepts or rejects — typically achieving 2-3× throughput improvement without accuracy loss.
License and access
MiMo-V2.5-Pro is released under the MIT license — full commercial use, modification, and redistribution allowed with minimal restrictions.
The model is available at: XiaoMi/MiMo-V2.5-Pro
Citation
@misc{xiaomicom2026.mimov25pro,
title={MiMo V2.5 Pro: Efficient Long-Context Language Model with Multi-Teacher On-Policy Distillation},
author={XiaoMi},
year={2026}
}
Tags :
- AI
- MiMo
- LLM
- MoE
- Agentic
- Open Source
- Long Context
- Coding
- FP8