google/gemma-4-E2B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Mar 2, 2026License:apache-2.0Architecture:Transformer0.2K Open Weights Warm

Gemma 4 E2B is a 2.3 billion effective parameter multimodal model developed by Google DeepMind, part of the Gemma 4 family. It supports text, image, and audio inputs with a 128K token context window. Optimized for on-device deployment, it excels in reasoning, coding, and agentic workflows, offering native function-calling and multilingual support across 140+ languages.

Loading preview...

Gemma 4 E2B: A Multimodal Model for On-Device AI

Gemma 4 E2B is a 2.3 billion effective parameter model from Google DeepMind's Gemma 4 family, designed for efficient on-device deployment. This model is multimodal, capable of processing text, image, and audio inputs, and generating text outputs. It features a 128K token context window and supports over 140 languages.

Key Capabilities

  • Multimodality: Natively processes text, images (with variable aspect ratio and resolution), and audio. It can also analyze video by processing sequences of frames.
  • Reasoning: Includes configurable thinking modes for step-by-step problem-solving.
  • Efficient Architecture: Utilizes Per-Layer Embeddings (PLE) to maximize parameter efficiency, making it suitable for high-end phones, laptops, and other edge devices.
  • Enhanced Coding & Agentic Capabilities: Shows significant improvements in coding benchmarks and offers native function-calling support for building autonomous agents.
  • Long Context: Supports a 128K token context window, enabling processing of extensive inputs.
  • Multilingual Support: Pre-trained on over 140 languages with out-of-the-box support for 35+ languages.

Good For

  • On-Device Applications: Ideal for deployment on mobile devices, laptops, and other edge environments due to its optimized architecture and smaller size.
  • Multimodal Understanding: Excellent for tasks requiring the interpretation of combined text, image, and audio inputs, such as image captioning, document parsing, and speech-to-text translation.
  • Reasoning and Problem Solving: Suitable for applications that benefit from structured thinking and logical deduction.
  • Code Generation and Agentic Workflows: Effective for generating, completing, and correcting code, as well as powering intelligent agents through native function calling.
  • Content Creation and Communication: Can be used for generating creative text formats, powering chatbots, text summarization, and extracting insights from visual and audio data.