Gemma 4 31B: Google's multimodal model with 256K context and thinking mode
- Bastien
- 03 Apr, 2026
What is Gemma 4 31B?
Gemma 4 31B (instruction-tuned variant: gemma-4-31B-it) is Google’s latest open-weights multimodal model with 30.7 billion parameters. It processes text, images, and video, supports a 256K token context window, and ships under the Apache 2.0 license.
Compared to the Gemma 3 generation, Gemma 4 brings a significantly longer context window (up from 128K to 256K), native thinking mode, video input, and strong benchmark improvements across reasoning, coding, and vision tasks.
Architecture
Key architectural decisions:
- Hybrid attention — alternates between local sliding window attention (1024 tokens) and global full attention, balancing efficiency with long-range coherence.
- Proportional RoPE — position encoding adapted for very long contexts, helping the model maintain coherence over 256K tokens.
- Vision encoder — ~550M parameter encoder supporting variable aspect ratios and configurable token budgets (70 to 1120 tokens per image), making it useful for both quick captioning and dense OCR.
- Dense architecture — 30.7B parameters, not a Mixture of Experts model, which means simpler deployment and more predictable memory requirements.
Total: 33B parameters including embeddings.
Modalities and capabilities
| Modality | Support |
|---|---|
| Text input / output | Yes |
| Images | Variable aspect ratio, up to 1120 tokens/image |
| Video | Up to 60 seconds at 1 fps |
| Audio | No (available on E2B/E4B variants only) |
| Capability | Details |
|---|---|
| Thinking mode | Configurable step-by-step reasoning via enable_thinking=True |
| Function calling | Native structured tool use |
| Multilingual | 140+ languages |
| Document/PDF parsing | OCR, handwriting recognition |
| System prompts | Native system role support |
| Context window | 256K tokens |
Benchmark results
| Benchmark | Score |
|---|---|
| MMLU Pro | 85.2% |
| AIME 2026 | 89.2% |
| LiveCodeBench v6 | 80.0% |
| Codeforces ELO | 2150 |
| GPQA Diamond | 84.3% |
| MMMU Pro (Vision) | 76.9% |
| MATH-Vision | 85.6% |
| Long Context (128K) | 66.4% |
AIME 2026 at 89.2% and Codeforces ELO at 2150 are particularly notable for an open-weights model at this size.
Usage
Install dependencies:
pip install -U transformers torch accelerate
Basic text generation:
from transformers import AutoProcessor, AutoModelForCausalLM
MODEL_ID = "google/gemma-4-31B-it"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between RAG and fine-tuning."},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
processor.parse_response(response)
With thinking mode enabled:
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Step-by-step reasoning before answering
)
Image understanding:
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto"
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://example.com/chart.png"},
{"type": "text", "text": "What trend does this chart show?"}
]
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
processor.parse_response(response)
Recommended sampling settings:
temperature = 1.0
top_p = 0.95
top_k = 64
Image token budgets
The vision encoder supports configurable token budgets per image, letting you trade off quality against speed:
| Budget | Recommended use |
|---|---|
| 70–140 tokens | Quick classification, general captioning |
| 280 tokens | Standard image understanding |
| 560 tokens | Dense scenes, small text |
| 1120 tokens | OCR, document parsing, fine detail |
For video, lower budgets (70–280) are recommended given the number of frames.
Deployment
26 quantized variants are available on the Hub for llama.cpp, LM Studio, Jan, and Ollama.
Install for video support:
pip install -U transformers torch torchvision torchcodec librosa accelerate
Limitations
- Factual accuracy — may produce incorrect or outdated information; knowledge cutoff January 2025
- No audio input — audio is only available on E2B/E4B model variants
- Long-context degradation — long context score drops at 128K+ tokens (66.4%); very long documents may see coherence issues
- Language nuance — figurative language, sarcasm, and cultural references may be handled inconsistently
Conclusion
Gemma 4 31B is a significant step forward for open-weights multimodal models. The combination of a 256K context window, thinking mode, native function calling, and strong coding/reasoning benchmarks — under Apache 2.0 — makes it one of the most capable open models available at this size.
The variable image resolution system is a practical addition: you can tune token budgets to match the task, keeping inference fast for simple captioning and switching to high fidelity for OCR or document parsing.
Model: google/gemma-4-31B-it
Tags :
- AI
- Gemma
- Multimodal
- Vision
- Open Source