Type something to search...
Gemma 4 31B: Google's multimodal model with 256K context and thinking mode

Gemma 4 31B: Google's multimodal model with 256K context and thinking mode

What is Gemma 4 31B?

Gemma 4 31B (instruction-tuned variant: gemma-4-31B-it) is Google’s latest open-weights multimodal model with 30.7 billion parameters. It processes text, images, and video, supports a 256K token context window, and ships under the Apache 2.0 license.

Compared to the Gemma 3 generation, Gemma 4 brings a significantly longer context window (up from 128K to 256K), native thinking mode, video input, and strong benchmark improvements across reasoning, coding, and vision tasks.


Architecture

Key architectural decisions:

  • Hybrid attention — alternates between local sliding window attention (1024 tokens) and global full attention, balancing efficiency with long-range coherence.
  • Proportional RoPE — position encoding adapted for very long contexts, helping the model maintain coherence over 256K tokens.
  • Vision encoder — ~550M parameter encoder supporting variable aspect ratios and configurable token budgets (70 to 1120 tokens per image), making it useful for both quick captioning and dense OCR.
  • Dense architecture — 30.7B parameters, not a Mixture of Experts model, which means simpler deployment and more predictable memory requirements.

Total: 33B parameters including embeddings.


Modalities and capabilities

ModalitySupport
Text input / outputYes
ImagesVariable aspect ratio, up to 1120 tokens/image
VideoUp to 60 seconds at 1 fps
AudioNo (available on E2B/E4B variants only)
CapabilityDetails
Thinking modeConfigurable step-by-step reasoning via enable_thinking=True
Function callingNative structured tool use
Multilingual140+ languages
Document/PDF parsingOCR, handwriting recognition
System promptsNative system role support
Context window256K tokens

Benchmark results

BenchmarkScore
MMLU Pro85.2%
AIME 202689.2%
LiveCodeBench v680.0%
Codeforces ELO2150
GPQA Diamond84.3%
MMMU Pro (Vision)76.9%
MATH-Vision85.6%
Long Context (128K)66.4%

AIME 2026 at 89.2% and Codeforces ELO at 2150 are particularly notable for an open-weights model at this size.


Usage

Install dependencies:

pip install -U transformers torch accelerate

Basic text generation:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "google/gemma-4-31B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the difference between RAG and fine-tuning."},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
processor.parse_response(response)

With thinking mode enabled:

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Step-by-step reasoning before answering
)

Image understanding:

from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForMultimodalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/chart.png"},
            {"type": "text", "text": "What trend does this chart show?"}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
processor.parse_response(response)

Recommended sampling settings:

temperature = 1.0
top_p = 0.95
top_k = 64

Image token budgets

The vision encoder supports configurable token budgets per image, letting you trade off quality against speed:

BudgetRecommended use
70–140 tokensQuick classification, general captioning
280 tokensStandard image understanding
560 tokensDense scenes, small text
1120 tokensOCR, document parsing, fine detail

For video, lower budgets (70–280) are recommended given the number of frames.


Deployment

26 quantized variants are available on the Hub for llama.cpp, LM Studio, Jan, and Ollama.

Install for video support:

pip install -U transformers torch torchvision torchcodec librosa accelerate

Limitations

  • Factual accuracy — may produce incorrect or outdated information; knowledge cutoff January 2025
  • No audio input — audio is only available on E2B/E4B model variants
  • Long-context degradation — long context score drops at 128K+ tokens (66.4%); very long documents may see coherence issues
  • Language nuance — figurative language, sarcasm, and cultural references may be handled inconsistently

Conclusion

Gemma 4 31B is a significant step forward for open-weights multimodal models. The combination of a 256K context window, thinking mode, native function calling, and strong coding/reasoning benchmarks — under Apache 2.0 — makes it one of the most capable open models available at this size.

The variable image resolution system is a practical addition: you can tune token budgets to match the task, keeping inference fast for simple captioning and switching to high fidelity for OCR or document parsing.

Model: google/gemma-4-31B-it

Tags :
  • AI
  • Google
  • Gemma
  • Multimodal
  • Vision
  • Open Source
Share :

Related Posts

ChatGPT: Beware of These Malicious Chrome Extensions

ChatGPT: Beware of These Malicious Chrome Extensions

Are your ChatGPT secrets truly secure? The massive hype surrounding ChatGPT has led to the birth of thousands of Chrome extensions promising to enhance user experience. However, a recent study h

Read More
Agentic AI Smartphones: The Next Frontier for Enterprise

Agentic AI Smartphones: The Next Frontier for Enterprise

The rise of the "doer" AI The recent launch of the ZTE Nubia M153 prototype, powered by ByteDance's Doubao model, marks a decisive turning point. We are moving from passive voice assistants to "

Read More
Chroma Context-1: the 20B agentic search model that edits its own context

Chroma Context-1: the 20B agentic search model that edits its own context

What is Chroma Context-1? Chroma Context-1 is a 20B Mixture of Experts model built specifically for agentic search — retrieval tasks that require multiple hops, query decomposition, and self

Read More
Claude Opus 4.5: The Next Generation of AI

Claude Opus 4.5: The Next Generation of AI

Introduction to Claude Opus 4.5 Claude Opus 4.5, released on November 25, 2025, represents a significant leap forward in AI technology. This latest version brings a host of new features and impr

Read More
Cohere Transcribe: a 2B ASR model that tops the English leaderboard

Cohere Transcribe: a 2B ASR model that tops the English leaderboard

What is Cohere Transcribe? Cohere Transcribe 03-2026 is an automatic speech recognition (ASR) model released by Cohere Labs. With 2B parameters, it ranks **#1 on the English ASR leaderboard*

Read More
GLM-5: 744B parameters, 40B active — ZhipuAI's open-source frontier model

GLM-5: 744B parameters, 40B active — ZhipuAI's open-source frontier model

What is GLM-5? GLM-5 is a large language model released by ZhipuAI (智谱AI). It has 744 billion total parameters with only 40 billion active at inference — the same Mixture of Experts

Read More
Google Snapseed: A New Photo Experience Arrives on iPhone

Google Snapseed: A New Photo Experience Arrives on iPhone

Introduction: Google surprises mobile photographers Google has just made a major move in the iOS ecosystem by launching a dedicated camera app, directly linked to its famous Snapseed editing suit

Read More
Mistral Small 4: One Unified Model to Rule Reasoning, Code, and Vision

Mistral Small 4: One Unified Model to Rule Reasoning, Code, and Vision

For years, the AI model landscape has operated along a familiar tension: large models that are capable but expensive to run, versus small models that are fast but frustratingly limited. Mistral AI's

Read More
Mistral's Devstral 2: The Return of Sovereign Code AI

Mistral's Devstral 2: The Return of Sovereign Code AI

The European Counter-Strike in Code AI With the release of Devstral 2 and its lightweight counterpart Devstral Small 2, Mistral AI is effectively reclaiming territory in a sector recently domina

Read More
Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics

Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics

What is Nemotron Cascade 2? Nemotron Cascade 2 (30B-A3B) is an open model released by NVIDIA on March 19, 2026. Its headline number is deceptive: 30 billion total parameters, but only **3 bi

Read More
NVIDIA Nemotron-3 Super: a 120B MoE model that runs on a single GPU

NVIDIA Nemotron-3 Super: a 120B MoE model that runs on a single GPU

On March 11, 2026, NVIDIA released Nemotron-3 Super — a model that sits at an unusual intersection: 120 billion total parameters, only 12 billion active during inference, deployable on a single G

Read More
Qianfan-OCR: Baidu's 4B model that beats Gemini on document parsing

Qianfan-OCR: Baidu's 4B model that beats Gemini on document parsing

What is Qianfan-OCR? Qianfan-OCR is a document understanding model released by Baidu. It converts images of documents — PDFs, scans, photos, screenshots — directly into structured Markdown,

Read More
Qwen3.5-27B Distilled by Claude 4.6 Opus: A Local Reasoning Powerhouse

Qwen3.5-27B Distilled by Claude 4.6 Opus: A Local Reasoning Powerhouse

What is this model? Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is an open-source 28B language model published by Jackrong on Hugging Face. The idea is

Read More
Project Ava: Razer Traps an AI in a Connected Jar

Project Ava: Razer Traps an AI in a Connected Jar

AI steps out of the screen with Razer Beyond RGB mice and keyboards, Razer is exploring new horizons with Project Ava. This concept, introduced as an "AI companion in a jar," aims to humaniz

Read More
Technology (definition)

Technology (definition)

Technology and ecology: a sustainable alliance At Reeboot, we firmly believe that technology and ecology can go hand in hand. Our mission is to provide high-performance products while adopting a

Read More
The Asus ROG Strix SCAR 18 Monster, VPN and Health: Today's Tech News

The Asus ROG Strix SCAR 18 Monster, VPN and Health: Today's Tech News

Introduction: a concentration of innovations and vigilance The world of technology never stops, and this morning, the news offers us a fascinating mix of raw performance, digital geopolitics, and

Read More
Voxtral-4B: Mistral's open-weights TTS model that speaks 9 languages in real time

Voxtral-4B: Mistral's open-weights TTS model that speaks 9 languages in real time

What is Voxtral-4B? Voxtral-4B-TTS-2603 is a text-to-speech model released by Mistral AI in March 2026. It converts text to realistic speech in 9 languages, with 20 built-in preset voices an

Read More
Windows 11: Your Android Apps Now in Full Screen on PC

Windows 11: Your Android Apps Now in Full Screen on PC

Breaking the barriers between mobile and PC Microsoft is taking another major step in unifying its operating systems. Thanks to an update to the "Phone Link" tool, users can now project their An

Read More