google/gemma-4-26B-A4B

Hugging Face
TEXT GENERATIONConcurrency Cost:2Model Size:26BQuant:FP8Ctx Length:32kPublished:Mar 12, 2026License:apache-2.0Architecture:Transformer0.2K Open Weights Warm

Gemma-4-26B-A4B is a 25.2 billion total parameter multimodal Mixture-of-Experts (MoE) model developed by Google DeepMind, part of the Gemma 4 family. It features 3.8 billion active parameters for fast inference and a 256K token context window. This model excels at reasoning, coding, and multimodal understanding, processing text, images, and video inputs to generate text outputs.

Loading preview...

Gemma 4 26B A4B: Multimodal MoE for Reasoning and Coding

Google DeepMind's Gemma 4 26B A4B is a 25.2 billion total parameter Mixture-of-Experts (MoE) model, designed for efficient and powerful multimodal AI. It leverages 3.8 billion active parameters during inference, allowing it to run almost as fast as a 4B-parameter model while delivering performance comparable to larger models. This model supports a substantial 256K token context window and is proficient in over 140 languages.

Key Capabilities

  • Multimodal Understanding: Processes text, images (with variable aspect ratio and resolution), and video inputs, generating text outputs. It supports interleaved multimodal input.
  • Reasoning: Features configurable thinking modes for step-by-step problem-solving.
  • Coding & Agentic Workflows: Achieves significant improvements in coding benchmarks and includes native function-calling support for autonomous agents.
  • Native System Prompt Support: Enhances structured and controllable conversations.
  • Long Context: Handles contexts up to 256K tokens.

Benchmark Highlights

  • MMLU Pro: 82.6%
  • AIME 2026 (no tools): 88.3%
  • LiveCodeBench v6: 77.1%
  • MMMU Pro (Vision): 73.8%

Good For

  • Reasoning and complex problem-solving.
  • Code generation, completion, and correction.
  • Agentic workflows requiring structured tool use.
  • Applications requiring multimodal input (text, image, video) for text generation.
  • Deployments where fast inference is crucial, balancing performance with efficiency.