Gemma 3 12B Instruction-Tuned (QAT)

This model is a 12 billion parameter, instruction-tuned variant from Google DeepMind's Gemma 3 family, leveraging the same research and technology as the Gemini models. It is notable for its Quantization Aware Training (QAT), which allows it to preserve bfloat16 quality while drastically reducing memory footprint, making it efficient for deployment.

Key Capabilities

Multimodal: Processes both text and image inputs (images normalized to 896x896 resolution, encoded to 256 tokens each).
Extensive Context: Supports a total input context of 32K tokens.
Multilingual Support: Trained on data including over 140 languages.
Diverse Task Performance: Excels in text generation, image understanding, question answering, summarization, and reasoning.
Optimized for Efficiency: QAT enables deployment in environments with limited resources like laptops or desktops.

Training and Performance

The 12B model was trained on 12 trillion tokens, encompassing web documents, code, mathematics, and images. It demonstrates strong performance across various benchmarks, including:

Reasoning: Achieves 84.2 on HellaSwag (10-shot) and 72.6 on BIG-Bench Hard (few-shot).
STEM & Code: Scores 74.5 on MMLU (5-shot) and 45.7 on HumanEval (0-shot).
Multilingual: Reaches 64.3 on MGSM and 69.4 on Global-MMLU-Lite.
Multimodal: Achieves 71.2 on VQAv2 and 50.3 on MMMU.

Good for

Applications requiring efficient multimodal processing (text and image).
Deploying powerful language models on resource-constrained devices due to QAT optimization.
Tasks involving multilingual text generation and understanding.
Developing solutions for question answering, summarization, and complex reasoning.

Overview

Gemma 3 12B Instruction-Tuned (QAT)

Key Capabilities

Training and Performance

Good for

Full Model Card (README)