Gemma 3 12B Instruction-Tuned Model

This model is a 12 billion parameter instruction-tuned variant from Google DeepMind's Gemma 3 family, built using the same research and technology as the Gemini models. It is a multimodal model, accepting both text and image inputs to produce text outputs, and offers open weights. A key feature is its substantial 128K token context window, alongside extensive multilingual support for over 140 languages.

Key Capabilities

Multimodal Understanding: Processes both text and images (normalized to 896x896 resolution, encoded to 256 tokens each) to generate relevant text.
Extended Context: Supports a total input context of 128K tokens, enabling processing of longer and more complex inputs.
Multilingual Support: Trained on web documents in over 140 languages, enhancing its ability to understand and generate text across diverse linguistic contexts.
Versatile Text Generation: Excels at tasks such as question answering, summarization, and reasoning, making it suitable for various applications.
Resource-Efficient Deployment: Its relatively compact size allows for deployment on devices with limited resources, including laptops, desktops, and private cloud infrastructure.

Training and Performance

The 12B model was trained on 12 trillion tokens, encompassing web documents, code, mathematics, and images. It demonstrates strong performance across various benchmarks, including reasoning (e.g., 72.6 on BIG-Bench Hard), STEM and code (e.g., 74.5 on MMLU, 71.0 on GSM8K), and multimodal tasks (e.g., 71.2 on VQAv2). The model was developed with rigorous CSAM and sensitive data filtering, and evaluated for child safety, content safety, and representational harms, showing significant improvements over previous Gemma models.

Overview

Gemma 3 12B Instruction-Tuned Model

Key Capabilities

Training and Performance

Full Model Card (README)