Gemma 4: Multimodal Models by Google DeepMind

Gemma 4 is a family of open multimodal models from Google DeepMind, offering both dense and Mixture-of-Experts (MoE) architectures. This release includes pre-trained and instruction-tuned variants, with the 31B model being a dense instruction-tuned version. These models are designed for text and image input, generating text output, while smaller E2B and E4B models also natively support audio.

Key Capabilities & Advancements

Multimodality: Processes text, image (with variable aspect ratio and resolution), and video across all models. E2B and E4B models additionally support audio.
Reasoning: Features configurable thinking modes for step-by-step reasoning.
Extended Context Window: Supports up to 256K tokens for medium models (including 31B) and 128K for smaller models.
Enhanced Coding & Agentic Capabilities: Demonstrates significant improvements in coding benchmarks and includes native function-calling support for autonomous agents.
Native System Prompt Support: Integrates a system role for more structured and controllable conversations.
Multilingual Support: Pre-trained on over 140 languages with out-of-the-box support for 35+ languages.

Performance Highlights

The Gemma 4 31B model shows strong performance across various benchmarks, including:

MMLU Pro: 85.2%
AIME 2026 no tools: 89.2%
LiveCodeBench v6: 80.0%
GPQA Diamond: 84.3%
MMMU Pro (Vision): 76.9%
MATH-Vision: 85.6%
Long Context (MRCR v2 8 needle 128k): 66.4%

Intended Usage

Gemma 4 models are well-suited for a wide range of applications:

Content Creation: Text generation, chatbots, summarization, image data extraction.
Research & Education: NLP and VLM research, language learning tools, knowledge exploration.
Agentic Workflows: Leveraging function calling for structured tool use.
Coding: Code generation, completion, and correction.

Overview

Gemma 4: Multimodal Models by Google DeepMind

Key Capabilities & Advancements

Performance Highlights

Intended Usage

Full Model Card (README)