Gemma 3: Multimodal, Multilingual, and Efficient

Google DeepMind's Gemma 3 is a family of open models, with this specific repository featuring the 12 billion parameter instruction-tuned version. Built with the same research as the Gemini models, Gemma 3 is designed for multimodal capabilities, accepting both text and image inputs to generate text outputs. A key differentiator for this 12B model is its use of Quantization Aware Training (QAT), allowing it to maintain high quality while significantly reducing memory requirements when quantized to Q4_0.

Key Capabilities & Features

Multimodal Input: Processes both text and images (normalized to 896x896 resolution, encoded to 256 tokens each).
Large Context Window: Supports a total input context of 128K tokens.
Multilingual Support: Trained on data in over 140 languages, enhancing its global applicability.
Optimized for Efficiency: QAT enables near bfloat16 quality with reduced memory footprint, suitable for deployment on laptops, desktops, or private cloud infrastructure.
Broad Task Performance: Excels in text generation, image understanding, question answering, summarization, and reasoning.

Good for Use Cases

Content Creation: Generating creative text, marketing copy, or email drafts.
Conversational AI: Powering chatbots and virtual assistants.
Image Data Extraction: Interpreting and summarizing visual data for text communications.
Research & Education: Serving as a foundation for VLM and NLP research, or developing language learning tools.

Performance Highlights (12B PT model)

Reasoning: Achieves 84.2 on HellaSwag (10-shot) and 72.6 on BIG-Bench Hard (few-shot).
STEM & Code: Scores 74.5 on MMLU (5-shot) and 45.7 on HumanEval (0-shot).
Multilingual: Reaches 64.3 on MGSM and 69.4 on Global-MMLU-Lite.
Multimodal: Demonstrates 82.3 on DocVQA and 75.2 on AI2D.

Overview

Gemma 3: Multimodal, Multilingual, and Efficient

Key Capabilities & Features

Good for Use Cases

Performance Highlights (12B PT model)

Full Model Card (README)