Overview

This model is an instruction-tuned 4.3 billion parameter variant from Google DeepMind's Gemma 3 family, designed for efficient deployment through Quantization Aware Training (QAT). It maintains quality comparable to bfloat16 while drastically reducing memory footprint when quantized to int4. Gemma 3 models are multimodal, processing both text and image inputs to generate text outputs, and feature open weights.

Key Capabilities

Multimodal Input: Handles text strings and images (normalized to 896x896 resolution, encoded to 256 tokens each).
Large Context Window: Supports a total input context of 128K tokens and an output context of 8192 tokens.
Multilingual Support: Trained on data including content in over 140 languages.
Efficient Deployment: QAT enables near bfloat16 quality with int4 quantization, making it suitable for devices with limited resources like laptops or cloud infrastructure.
Broad Task Suitability: Well-suited for text generation and image understanding tasks, including question answering, summarization, and reasoning.

Training & Evaluation

The 4B model was trained on 4 trillion tokens, encompassing web documents, code, mathematics, and images. Training utilized Google's TPU hardware (TPUv4p, TPUv5p, TPUv5e) and software like JAX and ML Pathways. Evaluation benchmarks cover reasoning, factuality, STEM, code, multilingual capabilities, and multimodal understanding, demonstrating strong performance across various metrics for its size.

Overview

Overview

Key Capabilities

Training & Evaluation

Full Model Card (README)