google/gemma-3-12b-it

Warm
Public
Vision
12B
FP8
32768
License: gemma
Hugging Face
Gated
Overview

Overview

Google DeepMind's Gemma 3 12B-IT is a 12 billion parameter instruction-tuned multimodal model, part of the Gemma family derived from Gemini research. It processes both text and image inputs, producing text outputs, and supports over 140 languages. The model features a substantial 128K token context window, enabling complex interactions and comprehensive understanding.

Key Capabilities

  • Multimodal Understanding: Processes text and images (normalized to 896x896 resolution, encoded to 256 tokens each) to generate text responses.
  • Extensive Context Window: Utilizes a 128K token context for the 12B variant, allowing for detailed and lengthy inputs.
  • Multilingual Support: Trained on data in over 140 languages, enhancing its global applicability.
  • Diverse Task Performance: Well-suited for question answering, summarization, reasoning, and content creation.
  • Optimized for Deployment: Its relatively compact size facilitates deployment on devices with limited resources, such as laptops and cloud infrastructure.

Performance Highlights

Evaluations show strong performance across various benchmarks:

  • Reasoning: Achieves 84.2 on HellaSwag (10-shot) and 72.6 on BIG-Bench Hard (few-shot).
  • STEM & Code: Scores 74.5 on MMLU (5-shot) and 45.7 on HumanEval (0-shot).
  • Multilingual: Reaches 64.3 on MGSM and 69.4 on Global-MMLU-Lite.
  • Multimodal: Demonstrates capabilities on benchmarks like COCOcap (111) and DocVQA (82.3).

Intended Usage

This model is designed for a wide range of applications, including:

  • Content Creation: Generating creative text formats, marketing copy, and email drafts.
  • Conversational AI: Powering chatbots and virtual assistants.
  • Information Extraction: Summarizing text and extracting insights from visual data.
  • Research & Education: Serving as a foundation for VLM/NLP research and language learning tools.