Model Overview

This model is part of the Gemma 3 family, developed by Google DeepMind, leveraging the same research and technology as the Gemini models. It is a multimodal model, capable of processing both text and image inputs to generate text outputs. The model features open weights for both pre-trained and instruction-tuned variants.

Key Capabilities

Multimodal Input: Accepts text strings and images (normalized to 896x896 resolution, encoded to 256 tokens each).
Large Context Window: Supports a total input context of 128K tokens.
Multilingual Support: Trained on data including content in over 140 languages.
Diverse Task Performance: Excels in text generation and image understanding tasks such as question answering, summarization, and reasoning.
Resource-Efficient Deployment: Its relatively small size (12B parameters) makes it suitable for deployment on devices with limited resources like laptops, desktops, or private cloud infrastructure.

Training Details

The 12B parameter variant was trained on 12 trillion tokens from a diverse dataset comprising web documents, code, mathematical texts, and a wide range of images. Rigorous data preprocessing included CSAM filtering, sensitive data filtering, and quality/safety filtering. Training was conducted on Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p, and TPUv5e) using JAX and ML Pathways.

Benchmark Performance

The Gemma 3 PT 12B model demonstrates strong performance across various benchmarks:

Reasoning & Factuality: Achieves 84.2 on HellaSwag (10-shot) and 72.6 on BIG-Bench Hard (few-shot).
STEM & Code: Scores 74.5 on MMLU (5-shot) and 71.0 on GSM8K (8-shot).
Multilingual: Reaches 64.3 on MGSM and 69.4 on Global-MMLU-Lite.
Multimodal: Attains 111 on COCOcap and 82.3 on DocVQA (val).

Intended Usage

This model is designed for a broad range of applications, including:

Content Creation: Generating creative text formats, marketing copy, and email drafts.
Conversational AI: Powering chatbots and virtual assistants.
Text Summarization: Creating concise summaries of documents.
Image Data Extraction: Interpreting and summarizing visual data for text communications.
Research & Education: Serving as a foundation for VLM/NLP research and language learning tools.

Overview

Model Overview

Key Capabilities

Training Details

Benchmark Performance

Intended Usage

Full Model Card (README)