Gemma 3 12B-IT is a 12 billion parameter instruction-tuned multimodal model developed by Google DeepMind, built from the same research and technology as the Gemini models. It handles both text and image inputs, generating text outputs, and supports over 140 languages with a large 128K token context window. This model excels at a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning, making it suitable for deployment in resource-limited environments.
Loading preview...
Overview
Google DeepMind's Gemma 3 12B-IT is a 12 billion parameter instruction-tuned multimodal model, part of the Gemma family derived from Gemini research. It processes both text and image inputs, producing text outputs, and supports over 140 languages. The model features a substantial 128K token context window, enabling complex interactions and comprehensive understanding.
Key Capabilities
- Multimodal Understanding: Processes text and images (normalized to 896x896 resolution, encoded to 256 tokens each) to generate text responses.
- Extensive Context Window: Utilizes a 128K token context for the 12B variant, allowing for detailed and lengthy inputs.
- Multilingual Support: Trained on data in over 140 languages, enhancing its global applicability.
- Diverse Task Performance: Well-suited for question answering, summarization, reasoning, and content creation.
- Optimized for Deployment: Its relatively compact size facilitates deployment on devices with limited resources, such as laptops and cloud infrastructure.
Performance Highlights
Evaluations show strong performance across various benchmarks:
- Reasoning: Achieves 84.2 on HellaSwag (10-shot) and 72.6 on BIG-Bench Hard (few-shot).
- STEM & Code: Scores 74.5 on MMLU (5-shot) and 45.7 on HumanEval (0-shot).
- Multilingual: Reaches 64.3 on MGSM and 69.4 on Global-MMLU-Lite.
- Multimodal: Demonstrates capabilities on benchmarks like COCOcap (111) and DocVQA (82.3).
Intended Usage
This model is designed for a wide range of applications, including:
- Content Creation: Generating creative text formats, marketing copy, and email drafts.
- Conversational AI: Powering chatbots and virtual assistants.
- Information Extraction: Summarizing text and extracting insights from visual data.
- Research & Education: Serving as a foundation for VLM/NLP research and language learning tools.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.