Type something to search...
MiMo-V2.5-Pro: 1.02T parameters, MIT-licensed agent powerhouse

MiMo-V2.5-Pro: 1.02T parameters, MIT-licensed agent powerhouse

From V2-Pro to V2.5-Pro: the long-context breakthrough

XiaoMi’s MiMo family has rapidly positioned itself among the top open-weight models. MiMo-V2.5-Pro is the latest iteration — a 1.02 trillion-parameter Mixture of Experts model with 42B active parameters that significantly extends what was possible in MiMo-V2-Pro. The headline capability is not just scale, but long-context persistence at 1M tokens combined with agentic reasoning trained via multi-teacher on-policy distillation.

Where MiMo-V2-Pro proved the architecture could handle long sequences, V2.5-Pro proves it can retain and apply information across a full million tokens while maintaining agentic reasoning. On long-context graph navigation benchmarks, V2.5-Pro scores 0.62 on the Parents task at 1M tokens (vs. V2-Pro collapsing to 0 at that length), a qualitative leap in context retention.


Architecture: MoE with Hybrid Attention and Multi-Token Prediction

MiMo-V2.5-Pro uses an 70-layer architecture (1 dense + 69 MoE) with a carefully structured attention system and novel prediction heads.

Hybrid Attention (SWA + GA, 6:1 ratio).
60 layers use Sliding Window Attention (SWA) with a 128-token window, handling local context efficiently. The remaining 10 layers use Global Attention (GA) to capture long-range dependencies. The 6:1 ratio was found optimal during pre-training — enough global access to prevent information loss across 1M tokens, while the sliding window keeps per-layer compute manageable.

Mixture of Experts (384 experts, 8 routed per token).
The model has 384 total experts with 8 active per token, yielding 42B active parameters out of 1.02T total. The routing network distributes tokens across experts dynamically, with each expert specialising in different linguistic or reasoning patterns discovered during the 27T-token pre-training phase.

Multi-Token Prediction (3 layers).
Three MTP heads are attached to intermediate layers, allowing the model to predict the next 3 tokens simultaneously during training. This accelerates convergence and improves the quality of the learned representations without adding inference overhead, as MTP is only used during pre-training.

FP8 mixed-precision training.
MiMo-V2.5-Pro was trained in FP8 mixed precision, a first at this model scale. FP8 reduces memory bandwidth requirements and increases throughput during training without noticeable accuracy loss — the 27T-token pre-training phase ran at 32K sequence length using FP8, with a switch to BF16 for the final post-training stages.


Benchmark results

Base model evaluation (post pre-training, before fine-tuning)

BenchmarkMiMo-V2.5-Pro BaseMiMo-V2-Pro Base
BBH88.485.1
MMLU89.487.3
MMLU-Pro68.564.2
MATH86.283.8
HumanEval+75.671.3
SWE-Bench (AgentLess)35.731.2
C-Eval91.589.8

The base model already shows a strong gap over V2-Pro across all categories, reflecting the benefit of 27T pre-training tokens and FP8 training efficiency.

Post-training evaluation (SFT + agentic RL + MOPD)

BenchmarkMiMo-V2.5-ProMiMo-V2-Pro
SWE-Bench Pro57.2%
SWE-Bench Verified78.9%74.1%
TerminalBench 268.4%59.3%
GPQA-Diamond66.7%63.5%
GSM8K99.6%98.7%

The post-training gains are substantial — particularly on SWE-Bench Pro (+~4 points) and TerminalBench 2 (+9.1 points). The Multi-Teacher On-Policy Distillation (MOPD) plays a key role here: multiple teacher models guide the student through on-policy trajectories, teaching it to reason more systematically in agentic scenarios.


Long-context benchmark

V2.5-Pro’s defining achievement is preserving information at extreme context lengths.

BenchmarkMiMo-V2.5-Pro (1M)MiMo-V2.5 (1M)MiMo-V2-Pro (collapse)
GraphWalks BFS0.370.310.00
GraphWalks Parents0.620.480.00

At 1M tokens, V2-Pro completely collapses (0.00 on both tasks), while V2.5 retains meaningful ability and V2.5-Pro achieves the strongest scores. This is not a marginal improvement — it is the difference between a model that can reason across a 2000-page document and one that cannot.


What makes it different: long-context persistence + agentic capability

Two capabilities set MiMo-V2.5-Pro apart from competitors:

1. Long-context that actually works.
Hybrid Attention with the 6:1 SWA/GA ratio and MTP training together enable information to travel across 1M tokens. Most models lose context coherence well before 256K. V2.5-Pro was explicitly evaluated and validated on graph navigation tasks spanning millions of tokens — a task that requires remembering relationships across an arbitrarily long sequence.

2. Agentic reasoning via MOPD.
Post-training with Multi-Teacher On-Policy Distillation gives V2.5-Pro the ability to plan, execute multi-step operations (code, search, browse), and recover from errors autonomously. The agentic RL loop taught the model to maintain strategy across hundreds of action steps, which directly translates to SWE-Bench Pro scores of 57.2%.


Comparison with MiMo-V2.5 and MiMo-V2-Pro

DimensionMiMo-V2-ProMiMo-V2.5MiMo-V2.5-Pro
Total params1.02T1.02T1.02T
Active params42B42B42B
ArchitectureMoE + Hybrid AttnMoE + Hybrid AttnMoE + Hybrid Attn + MTP
Layers696970 (1 dense + 69 MoE)
Experts384 / 8384 / 8384 / 8
Context128K512K1M
Pre-training tokens14T20T27T
Pre-training precisionFP16BF16FP8 → BF16
Post-trainingSFTSFT + RLSFT + Agentic RL + MOPD
SWE-Bench Pro57.2%
GraphWalks Parents @ 1M0.000.000.62
LicenseMITMITMIT

V2.5-Pro is the only model in the family with MTP training, the full 1M context validation, and the MOPD post-training pipeline.


Deployment

MiMo-V2.5-Pro can be deployed via two primary inference frameworks:

  • SGLang (latest) with EAGLE speculative decoding for accelerated inference
  • vLLM (latest) for high-throughput serving

An API is available at platform.xiaomimimo.com.

For local deployment with EAGLE (speculative decoding), SGLang provides significant speedups by using a smaller draft model to propose tokens, which the V2.5-Pro verifier accepts or rejects — typically achieving 2-3× throughput improvement without accuracy loss.


License and access

MiMo-V2.5-Pro is released under the MIT license — full commercial use, modification, and redistribution allowed with minimal restrictions.

The model is available at: XiaoMi/MiMo-V2.5-Pro


Citation

@misc{xiaomicom2026.mimov25pro,
  title={MiMo V2.5 Pro: Efficient Long-Context Language Model with Multi-Teacher On-Policy Distillation},
  author={XiaoMi},
  year={2026}
}
Tags :
  • AI
  • MiMo
  • LLM
  • MoE
  • Agentic
  • Open Source
  • Long Context
  • Coding
  • FP8
Share :

Related Posts

DeepSeek-V4-Pro: Highly Efficient Million-Token Context Language Model

DeepSeek-V4-Pro: Highly Efficient Million-Token Context Language Model

Introduction DeepSeek-V4-Pro is a preview of the DeepSeek-V4 family released in 2026. It offers a 1.6 T‑parameter total size (49 B active) with a 1 M‑token context, using hybrid attention and the

Read More
ChatGPT: Beware of These Malicious Chrome Extensions

ChatGPT: Beware of These Malicious Chrome Extensions

Are your ChatGPT secrets truly secure? The massive hype surrounding ChatGPT has led to the birth of thousands of Chrome extensions promising to enhance user experience. However, a recent study h

Read More
Agentic AI Smartphones: The Next Frontier for Enterprise

Agentic AI Smartphones: The Next Frontier for Enterprise

The rise of the "doer" AI The recent launch of the ZTE Nubia M153 prototype, powered by ByteDance's Doubao model, marks a decisive turning point. We are moving from passive voice assistants to "

Read More
Kimi K2.6: 1T parameters, Moonshot's agentic coding and vision model

Kimi K2.6: 1T parameters, Moonshot's agentic coding and vision model

From K2 to K2.6: Moonshot's multimodal agent model Moonshot AI's Kimi K2.6 is a major step forward in combining three challenging capabilities into a single open-weight model: **massive-scal

Read More
Chroma Context-1: the 20B agentic search model that edits its own context

Chroma Context-1: the 20B agentic search model that edits its own context

What is Chroma Context-1? Chroma Context-1 is a 20B Mixture of Experts model built specifically for agentic search — retrieval tasks that require multiple hops, query decomposition, and self

Read More
Claude Opus 4.5: The Next Generation of AI

Claude Opus 4.5: The Next Generation of AI

Introduction to Claude Opus 4.5 Claude Opus 4.5, released on November 25, 2025, represents a significant leap forward in AI technology. This latest version brings a host of new features and impr

Read More
Claude Opus 4.7: Anthropic's software engineering flagship gets sharper

Claude Opus 4.7: Anthropic's software engineering flagship gets sharper

What is Claude Opus 4.7 On April 16, 2026, Anthropic released Claude Opus 4.7 — a targeted upgrade to its flagship model focused on one theme: rigor in long-running software engineering work

Read More
Cohere Transcribe: a 2B ASR model that tops the English leaderboard

Cohere Transcribe: a 2B ASR model that tops the English leaderboard

What is Cohere Transcribe? Cohere Transcribe 03-2026 is an automatic speech recognition (ASR) model released by Cohere Labs. With 2B parameters, it ranks **#1 on the English ASR leaderboard*

Read More
Gemma 4 31B: Google's multimodal model with 256K context and thinking mode

Gemma 4 31B: Google's multimodal model with 256K context and thinking mode

What is Gemma 4 31B? Gemma 4 31B (instruction-tuned variant: gemma-4-31B-it) is Google's latest open-weights multimodal model with 30.7 billion parameters. It processes text, images, and v

Read More
GLM-5.1: 754B parameters — Z.ai's agentic engineering flagship

GLM-5.1: 754B parameters — Z.ai's agentic engineering flagship

From GLM-5 to GLM-5.1: the agentic leap Less than two weeks after releasing GLM-5, Z.ai (formerly ZhipuAI) ships GLM-5.1 — a 754B-parameter Mixture of Experts model that does not just iterat

Read More
GLM-5: 744B parameters, 40B active — ZhipuAI's open-source frontier model

GLM-5: 744B parameters, 40B active — ZhipuAI's open-source frontier model

What is GLM-5? GLM-5 is a large language model released by ZhipuAI (智谱AI). It has 744 billion total parameters with only 40 billion active at inference — the same Mixture of Experts

Read More
Google Snapseed: A New Photo Experience Arrives on iPhone

Google Snapseed: A New Photo Experience Arrives on iPhone

Introduction: Google surprises mobile photographers Google has just made a major move in the iOS ecosystem by launching a dedicated camera app, directly linked to its famous Snapseed editing suit

Read More
LFM2.5-VL-450M: Liquid AI's 450M vision model that runs in a browser

LFM2.5-VL-450M: Liquid AI's 450M vision model that runs in a browser

What is LFM2.5-VL-450M Most vision-language models compete on scale — billions of parameters, hundreds of GPU-hours for inference. Liquid AI takes the opposite approach. LFM2.5-VL-450M is a

Read More
MiniMax-M2.7: a 229B model that engineers itself

MiniMax-M2.7: a 229B model that engineers itself

What is MiniMax-M2.7 MiniMax-M2.7 is a 229B-parameter dense model from MiniMax, a Beijing-based AI lab. Unlike most frontier models that iterate through human-supervised training cycles, M2.

Read More
Mistral Small 4: One Unified Model to Rule Reasoning, Code, and Vision

Mistral Small 4: One Unified Model to Rule Reasoning, Code, and Vision

For years, the AI model landscape has operated along a familiar tension: large models that are capable but expensive to run, versus small models that are fast but frustratingly limited. Mistral AI's

Read More
Mistral's Devstral 2: The Return of Sovereign Code AI

Mistral's Devstral 2: The Return of Sovereign Code AI

The European Counter-Strike in Code AI With the release of Devstral 2 and its lightweight counterpart Devstral Small 2, Mistral AI is effectively reclaiming territory in a sector recently domina

Read More
Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics

Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics

What is Nemotron Cascade 2? Nemotron Cascade 2 (30B-A3B) is an open model released by NVIDIA on March 19, 2026. Its headline number is deceptive: 30 billion total parameters, but only **3 bi

Read More
NVIDIA Nemotron-3 Super: a 120B MoE model that runs on a single GPU

NVIDIA Nemotron-3 Super: a 120B MoE model that runs on a single GPU

On March 11, 2026, NVIDIA released Nemotron-3 Super — a model that sits at an unusual intersection: 120 billion total parameters, only 12 billion active during inference, deployable on a single G

Read More
Qianfan-OCR: Baidu's 4B model that beats Gemini on document parsing

Qianfan-OCR: Baidu's 4B model that beats Gemini on document parsing

What is Qianfan-OCR? Qianfan-OCR is a document understanding model released by Baidu. It converts images of documents — PDFs, scans, photos, screenshots — directly into structured Markdown,

Read More
Qwen3.5-27B Distilled by Claude 4.6 Opus: A Local Reasoning Powerhouse

Qwen3.5-27B Distilled by Claude 4.6 Opus: A Local Reasoning Powerhouse

What is this model? Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is an open-source 28B language model published by Jackrong on Hugging Face. The idea is

Read More
Project Ava: Razer Traps an AI in a Connected Jar

Project Ava: Razer Traps an AI in a Connected Jar

AI steps out of the screen with Razer Beyond RGB mice and keyboards, Razer is exploring new horizons with Project Ava. This concept, introduced as an "AI companion in a jar," aims to humaniz

Read More
Technology (definition)

Technology (definition)

Technology and ecology: a sustainable alliance At Reeboot, we firmly believe that technology and ecology can go hand in hand. Our mission is to provide high-performance products while adopting a

Read More
The Asus ROG Strix SCAR 18 Monster, VPN and Health: Today's Tech News

The Asus ROG Strix SCAR 18 Monster, VPN and Health: Today's Tech News

Introduction: a concentration of innovations and vigilance The world of technology never stops, and this morning, the news offers us a fascinating mix of raw performance, digital geopolitics, and

Read More
Ubuntu 26.04 LTS: Rust coreutils, Wayland-only, and kernel 7.0

Ubuntu 26.04 LTS: Rust coreutils, Wayland-only, and kernel 7.0

Ubuntu 26.04 LTS: Resolute Raccoon Ubuntu 26.04 LTS, codenamed Resolute Raccoon, ships on April 23, 2026. The codename honors Steve Langasek, a former Debian and Ubuntu release manager who p

Read More
Voxtral-4B: Mistral's open-weights TTS model that speaks 9 languages in real time

Voxtral-4B: Mistral's open-weights TTS model that speaks 9 languages in real time

What is Voxtral-4B? Voxtral-4B-TTS-2603 is a text-to-speech model released by Mistral AI in March 2026. It converts text to realistic speech in 9 languages, with 20 built-in preset voices an

Read More
Windows 11: Your Android Apps Now in Full Screen on PC

Windows 11: Your Android Apps Now in Full Screen on PC

Breaking the barriers between mobile and PC Microsoft is taking another major step in unifying its operating systems. Thanks to an update to the "Phone Link" tool, users can now project their An

Read More