Sigtunnel/gemma-encoder

VISIONConcurrency Cost:1Model Size:12BQuant:FP8Ctx Length:32kPublished:Mar 30, 2026License:gemmaArchitecture:Transformer Cold

Gemma 3 is a family of lightweight, state-of-the-art open models from Google DeepMind, built from the same research and technology as Gemini. This 12 billion parameter instruction-tuned variant is multimodal, handling text and image input with a 128K context window, and generating text output. Optimized for a wide range of text generation and image understanding tasks, it supports over 140 languages and is designed for deployment in resource-limited environments.

Loading preview...

Gemma 3: Multimodal, Multilingual, and Efficient

Google DeepMind's Gemma 3 is a family of open models, with this specific repository featuring the 12 billion parameter instruction-tuned version. Built with the same research as the Gemini models, Gemma 3 is designed for multimodal capabilities, accepting both text and image inputs to generate text outputs. A key differentiator for this 12B model is its use of Quantization Aware Training (QAT), allowing it to maintain high quality while significantly reducing memory requirements when quantized to Q4_0.

Key Capabilities & Features

  • Multimodal Input: Processes both text and images (normalized to 896x896 resolution, encoded to 256 tokens each).
  • Large Context Window: Supports a total input context of 128K tokens.
  • Multilingual Support: Trained on data in over 140 languages, enhancing its global applicability.
  • Optimized for Efficiency: QAT enables near bfloat16 quality with reduced memory footprint, suitable for deployment on laptops, desktops, or private cloud infrastructure.
  • Broad Task Performance: Excels in text generation, image understanding, question answering, summarization, and reasoning.

Good for Use Cases

  • Content Creation: Generating creative text, marketing copy, or email drafts.
  • Conversational AI: Powering chatbots and virtual assistants.
  • Image Data Extraction: Interpreting and summarizing visual data for text communications.
  • Research & Education: Serving as a foundation for VLM and NLP research, or developing language learning tools.

Performance Highlights (12B PT model)

  • Reasoning: Achieves 84.2 on HellaSwag (10-shot) and 72.6 on BIG-Bench Hard (few-shot).
  • STEM & Code: Scores 74.5 on MMLU (5-shot) and 45.7 on HumanEval (0-shot).
  • Multilingual: Reaches 64.3 on MGSM and 69.4 on Global-MMLU-Lite.
  • Multimodal: Demonstrates 82.3 on DocVQA and 75.2 on AI2D.