The Gemma 3 12B IT model by Google DeepMind is a 12 billion parameter instruction-tuned multimodal language model with a 128K token context window. It handles both text and image inputs to generate text outputs, supporting over 140 languages. This model is optimized for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning, and is designed for deployment in resource-limited environments.
Loading preview...
Overview
Gemma 3 is a family of lightweight, state-of-the-art open models developed by Google DeepMind, built using the same research and technology as the Gemini models. The unsloth/gemma-3-12b-it variant is a 12 billion parameter instruction-tuned model featuring a large 128K context window and multilingual support across over 140 languages. It is multimodal, capable of processing both text and image inputs (normalized to 896x896 resolution) to generate text outputs.
Key Capabilities
- Multimodal Understanding: Processes text and images to generate relevant text, suitable for tasks like image analysis and visual data extraction.
- Extensive Context Window: Utilizes a 128K token context window for comprehensive understanding and generation.
- Multilingual Support: Trained on data in over 140 languages, enhancing its global applicability.
- Versatile Text Generation: Excels in tasks such as question answering, summarization, reasoning, creative text generation, and conversational AI.
- Resource-Efficient Deployment: Its relatively small size makes it suitable for deployment on devices with limited resources, including laptops and cloud infrastructure.
Performance Highlights
The Gemma 3 12B model demonstrates strong performance across various benchmarks:
- Reasoning: Achieved 84.2 on HellaSwag (10-shot) and 72.6 on BIG-Bench Hard (few-shot).
- STEM & Code: Scored 74.5 on MMLU (5-shot) and 45.7 on HumanEval (0-shot).
- Multimodal: Achieved 111 on COCOcap and 71.2 on VQAv2.
Good for
- Developers and researchers experimenting with multimodal AI.
- Applications requiring text generation from diverse inputs, including images.
- Deployment in environments where computational resources are a constraint.
- Building multilingual applications and tools.
Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.