GLM-5: 744B MoE model, AIME 92.7%, SWE-bench 77.8%, 96.9% HMMT

What is GLM-5?

GLM-5 is a large language model released by Z.ai (智谱AI). It has 744 billion total parameters with only 40 billion active at inference — the same Mixture of Experts efficiency pattern that made DeepSeek-V3 practical to deploy at scale.

It is the direct successor to GLM-4.5 (355B/32B active) and GLM-4.7, with substantially more pre-training data (28.5T tokens vs 23T), a novel sparse attention mechanism, and a post-training infrastructure built specifically for long-horizon agentic tasks. The paper title gives the ambition away: GLM-5: from Vibe Coding to Agentic Engineering.

The benchmark numbers are frontier-level for an open-weight model: 92.7% on AIME 2026, 77.8% on SWE-bench Verified, and 96.9% on HMMT Nov 2025 — the best open-source score on that competition math benchmark.

Architecture and training

DeepSeek Sparse Attention (DSA) is integrated to reduce deployment cost while preserving long-context capacity. At 744B total parameters, hardware requirements are significant — but the 40B active parameter count keeps per-token compute at a manageable level.

The "slime" RL infrastructure is ZhipuAI's solution to training models on complex, multi-step tasks. Standard RLHF struggles with long-horizon tasks because reward signals are sparse. The asynchronous design decouples generation from optimization, allowing larger batch sizes and more stable training on multi-step agent tasks.

Benchmark results

Mathematics

Benchmark	GLM-5	GLM-4.7
AIME 2026 I	92.7%	—
HMMT Nov 2025	96.9%	—
HLE (no tools)	30.5	24.8
HLE (with tools)	50.4	—

HMMT (Harvard-MIT Mathematics Tournament) is a highly competitive undergraduate-level math tournament. 96.9% is the best open-source result on this benchmark.

Coding and software engineering

Benchmark	Score
SWE-bench Verified	77.8%
Terminal-Bench 2.0	56.2–61.1%

SWE-bench Verified measures the ability to resolve real GitHub issues on open-source codebases. At 77.8%, GLM-5 sits at the frontier for open models. Terminal-Bench scores are competitive with Claude Opus 4.5 on command-line engineering tasks.

Reasoning and knowledge

Benchmark	Score
GPQA-Diamond	86.0%
HLE (Humanity's Last Exam)	30.5
HLE with tools	50.4

GPQA-Diamond is a PhD-level expert reasoning benchmark; 86.0% puts GLM-5 among the top models available. HLE is the hardest general knowledge evaluation currently in use.

Cybersecurity

Benchmark	Score
CyberGym	43.2%

Context window and agentic use

GLM-5 supports up to 202,752 tokens in reasoning + tool use configurations — long enough to hold entire codebases, long reports, or multi-turn agent trajectories in context.

The model natively supports:

Tool calling via the GLM-4.7 parser with auto-tool-choice
Extended reasoning via the GLM-4.5 reasoning parser
Web browsing, terminal execution, and function calling

Deployment

An FP8 quantized version (zai-org/GLM-5-FP8) is available, reducing memory requirements significantly. Both vLLM and SGLang are supported, with speculative decoding enabled for higher throughput.

With vLLM:

docker pull vllm/vllm-openai:nightly

vllm serve zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.85 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5-fp8

With SGLang:

python3 -m sglang.launch_server \
  --model-path zai-org/GLM-5-FP8 \
  --tp-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3

Ascend NPU deployments are supported via KTransformers and xLLM.

Limitations

Languages: English and Chinese only — no multilingual coverage beyond these two
Scale requirements — even with FP8 quantization, 40B active parameters requires multi-GPU setups (8× tensor parallel in the examples)
No public API yet — self-hosted only for now
Context at 202K requires specific configuration — default evaluation context is 128K
License details not specified in the model card; verify before commercial deployment

Conclusion

GLM-5 enters the frontier open-source tier that only DeepSeek-V3 and a handful of others occupy. The 744B/40B MoE design keeps inference practical while delivering benchmark numbers — 96.9% HMMT, 77.8% SWE-bench, 86.0% GPQA-Diamond — that match or exceed many closed models.

For teams needing a self-hosted model for serious math, coding, or agentic workloads without depending on an API, GLM-5 is now the strongest option available.

Model: zai-org/GLM-5 · FP8 version

GLM-5: 744B parameters, 40B active — Z.ai's open-source frontier model