>_Reeboot
Gemma 4 31B: Google's multimodal model with 256K context and thinking mode
AI

Gemma 4 31B: Google's multimodal model with 256K context and thinking mode

Google's Gemma 4 31B is a dense 30.7B multimodal model supporting text, images, and video with a 256K context window, native thinking mode, function calling, and 140+ languages β€” released under Apache

What is Gemma 4 31B?

Gemma 4 31B (instruction-tuned variant: gemma-4-31B-it) is Google's latest open-weights multimodal model with 30.7 billion parameters. It processes text, images, and video, supports a 256K token context window, and ships under the Apache 2.0 license.

Compared to the Gemma 3 generation, Gemma 4 brings a significantly longer context window (up from 128K to 256K), native thinking mode, video input, and strong benchmark improvements across reasoning, coding, and vision tasks.


Architecture

Key architectural decisions:

  • Hybrid attention β€” alternates between local sliding window attention (1024 tokens) and global full attention, balancing efficiency with long-range coherence.
  • Proportional RoPE β€” position encoding adapted for very long contexts, helping the model maintain coherence over 256K tokens.
  • Vision encoder β€” ~550M parameter encoder supporting variable aspect ratios and configurable token budgets (70 to 1120 tokens per image), making it useful for both quick captioning and dense OCR.
  • Dense architecture β€” 30.7B parameters, not a Mixture of Experts model, which means simpler deployment and more predictable memory requirements.

Total: 33B parameters including embeddings.


Modalities and capabilities

Modality Support
Text input / output Yes
Images Variable aspect ratio, up to 1120 tokens/image
Video Up to 60 seconds at 1 fps
Audio No (available on E2B/E4B variants only)
Capability Details
Thinking mode Configurable step-by-step reasoning via enable_thinking=True
Function calling Native structured tool use
Multilingual 140+ languages
Document/PDF parsing OCR, handwriting recognition
System prompts Native system role support
Context window 256K tokens

Benchmark results

Benchmark Score
MMLU Pro 85.2%
AIME 2026 89.2%
LiveCodeBench v6 80.0%
Codeforces ELO 2150
GPQA Diamond 84.3%
MMMU Pro (Vision) 76.9%
MATH-Vision 85.6%
Long Context (128K) 66.4%

AIME 2026 at 89.2% and Codeforces ELO at 2150 are particularly notable for an open-weights model at this size.


Usage

Install dependencies:

pip install -U transformers torch accelerate

Basic text generation:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-31B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the difference between RAG and fine-tuning."},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
processor.parse_response(response)

With thinking mode enabled:

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Step-by-step reasoning before answering
)

Image understanding:

from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/chart.png"},
            {"type": "text", "text": "What trend does this chart show?"}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
processor.parse_response(response)

Recommended sampling settings:

temperature = 1.0
top_p = 0.95
top_k = 64

Image token budgets

The vision encoder supports configurable token budgets per image, letting you trade off quality against speed:

Budget Recommended use
70–140 tokens Quick classification, general captioning
280 tokens Standard image understanding
560 tokens Dense scenes, small text
1120 tokens OCR, document parsing, fine detail

For video, lower budgets (70–280) are recommended given the number of frames.


Deployment

26 quantized variants are available on the Hub for llama.cpp, LM Studio, Jan, and Ollama.

Install for video support:

pip install -U transformers torch torchvision torchcodec librosa accelerate

Limitations

  • Factual accuracy β€” may produce incorrect or outdated information; knowledge cutoff January 2025
  • No audio input β€” audio is only available on E2B/E4B model variants
  • Long-context degradation β€” long context score drops at 128K+ tokens (66.4%); very long documents may see coherence issues
  • Language nuance β€” figurative language, sarcasm, and cultural references may be handled inconsistently

Conclusion

Gemma 4 31B is a significant step forward for open-weights multimodal models. The combination of a 256K context window, thinking mode, native function calling, and strong coding/reasoning benchmarks β€” under Apache 2.0 β€” makes it one of the most capable open models available at this size.

The variable image resolution system is a practical addition: you can tune token budgets to match the task, keeping inference fast for simple captioning and switching to high fidelity for OCR or document parsing.

Model: google/gemma-4-31B-it