Overview
Overview
Google DeepMind's Gemma 3 12B-IT is a 12 billion parameter instruction-tuned multimodal model, part of the Gemma family derived from Gemini research. It processes both text and image inputs, producing text outputs, and supports over 140 languages. The model features a substantial 128K token context window, enabling complex interactions and comprehensive understanding.
Key Capabilities
- Multimodal Understanding: Processes text and images (normalized to 896x896 resolution, encoded to 256 tokens each) to generate text responses.
- Extensive Context Window: Utilizes a 128K token context for the 12B variant, allowing for detailed and lengthy inputs.
- Multilingual Support: Trained on data in over 140 languages, enhancing its global applicability.
- Diverse Task Performance: Well-suited for question answering, summarization, reasoning, and content creation.
- Optimized for Deployment: Its relatively compact size facilitates deployment on devices with limited resources, such as laptops and cloud infrastructure.
Performance Highlights
Evaluations show strong performance across various benchmarks:
- Reasoning: Achieves 84.2 on HellaSwag (10-shot) and 72.6 on BIG-Bench Hard (few-shot).
- STEM & Code: Scores 74.5 on MMLU (5-shot) and 45.7 on HumanEval (0-shot).
- Multilingual: Reaches 64.3 on MGSM and 69.4 on Global-MMLU-Lite.
- Multimodal: Demonstrates capabilities on benchmarks like COCOcap (111) and DocVQA (82.3).
Intended Usage
This model is designed for a wide range of applications, including:
- Content Creation: Generating creative text formats, marketing copy, and email drafts.
- Conversational AI: Powering chatbots and virtual assistants.
- Information Extraction: Summarizing text and extracting insights from visual data.
- Research & Education: Serving as a foundation for VLM/NLP research and language learning tools.