Gemma 4 31B-IT: 256K context, vision, video, function calling, Apache 2.0

What is Gemma 4 31B?

Gemma 4 31B (instruction-tuned variant: gemma-4-31B-it) is Google's latest open-weights multimodal model with 30.7 billion parameters. It processes text, images, and video, supports a 256K token context window, and ships under the Apache 2.0 license.

Compared to the Gemma 3 generation, Gemma 4 brings a significantly longer context window (up from 128K to 256K), native thinking mode, video input, and strong benchmark improvements across reasoning, coding, and vision tasks.

Architecture

Key architectural decisions:

Hybrid attention — alternates between local sliding window attention (1024 tokens) and global full attention, balancing efficiency with long-range coherence.
Proportional RoPE — position encoding adapted for very long contexts, helping the model maintain coherence over 256K tokens.
Vision encoder — ~550M parameter encoder supporting variable aspect ratios and configurable token budgets (70 to 1120 tokens per image), making it useful for both quick captioning and dense OCR.
Dense architecture — 30.7B parameters, not a Mixture of Experts model, which means simpler deployment and more predictable memory requirements.

Total: 33B parameters including embeddings.

Modalities and capabilities

Modality	Support
Text input / output	Yes
Images	Variable aspect ratio, up to 1120 tokens/image
Video	Up to 60 seconds at 1 fps
Audio	No (available on E2B/E4B variants only)

Capability	Details
Thinking mode	Configurable step-by-step reasoning via `enable_thinking=True`
Function calling	Native structured tool use
Multilingual	140+ languages
Document/PDF parsing	OCR, handwriting recognition
System prompts	Native `system` role support
Context window	256K tokens

Benchmark results

Benchmark	Score
MMLU Pro	85.2%
AIME 2026	89.2%
LiveCodeBench v6	80.0%
Codeforces ELO	2150
GPQA Diamond	84.3%
MMMU Pro (Vision)	76.9%
MATH-Vision	85.6%
Long Context (128K)	66.4%

AIME 2026 at 89.2% and Codeforces ELO at 2150 are particularly notable for an open-weights model at this size.

Usage

Install dependencies:

pip install -U transformers torch accelerate

Basic text generation:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-31B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the difference between RAG and fine-tuning."},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
processor.parse_response(response)

With thinking mode enabled:

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Step-by-step reasoning before answering
)

Image understanding:

from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/chart.png"},
            {"type": "text", "text": "What trend does this chart show?"}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
processor.parse_response(response)

Recommended sampling settings:

temperature = 1.0
top_p = 0.95
top_k = 64

Image token budgets

The vision encoder supports configurable token budgets per image, letting you trade off quality against speed:

Budget	Recommended use
70–140 tokens	Quick classification, general captioning
280 tokens	Standard image understanding
560 tokens	Dense scenes, small text
1120 tokens	OCR, document parsing, fine detail

For video, lower budgets (70–280) are recommended given the number of frames.

Deployment

26 quantized variants are available on the Hub for llama.cpp, LM Studio, Jan, and Ollama.

Install for video support:

pip install -U transformers torch torchvision torchcodec librosa accelerate

Limitations

Factual accuracy — may produce incorrect or outdated information; knowledge cutoff January 2025
No audio input — audio is only available on E2B/E4B model variants
Long-context degradation — long context score drops at 128K+ tokens (66.4%); very long documents may see coherence issues
Language nuance — figurative language, sarcasm, and cultural references may be handled inconsistently

Conclusion

Gemma 4 31B is a significant step forward for open-weights multimodal models. The combination of a 256K context window, thinking mode, native function calling, and strong coding/reasoning benchmarks — under Apache 2.0 — makes it one of the most capable open models available at this size.

The variable image resolution system is a practical addition: you can tune token budgets to match the task, keeping inference fast for simple captioning and switching to high fidelity for OCR or document parsing.

Model: google/gemma-4-31B-it

Gemma 4 31B: Google's multimodal model with 256K context and thinking mode