pankajmathur/Mimma-3-12b

VISIONConcurrency Cost:1Model Size:12BQuant:FP8Ctx Length:32kLicense:gemmaArchitecture:Transformer0.0K Cold

Mimma-3-12b by pankajmathur is a multimodal vision-language model based on the Gemma 3 architecture, designed to handle both text and image inputs and generate text outputs. This 12 billion parameter model features a large 128K context window and multilingual support for over 140 languages. It excels at a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning, making it suitable for resource-limited environments.

Loading preview...

Mimma-3-12b: A Multimodal Gemma 3 Model

Mimma-3-12b, developed by pankajmathur, is a multimodal vision-language model built upon the Gemma 3 architecture. This model is designed to process both text and image inputs, generating text outputs. It is part of the Gemma 3 family, which are lightweight, open models from Google, leveraging the same research and technology as the Gemini models.

Key Capabilities

  • Multimodal Input: Accepts text strings and images (normalized to 896x896 resolution, encoded to 256 tokens each).
  • Large Context Window: Features a substantial 128K token context window, enabling processing of extensive inputs.
  • Multilingual Support: Trained on data including content in over 140 languages.
  • Diverse Task Performance: Well-suited for text generation and image understanding tasks such as question answering, summarization, and reasoning.
  • Optimized for Resources: Its relatively smaller size makes it deployable in environments with limited resources, like laptops or desktops.

Training and Performance

The underlying Gemma 3 12B model was trained on 12 trillion tokens, encompassing web documents, code, mathematics, and images. It demonstrates strong performance across various benchmarks, including reasoning (e.g., 84.2 on HellaSwag), STEM and code (e.g., 74.5 on MMLU, 71.0 on GSM8K), and multimodal tasks (e.g., 111 on COCOcap, 82.3 on DocVQA).

Good for

  • Content Creation: Generating creative text formats, marketing copy, or email drafts.
  • Conversational AI: Powering chatbots and virtual assistants.
  • Text Summarization: Creating concise summaries of documents or research papers.
  • Image Data Extraction: Interpreting and summarizing visual data for text communications.
  • Research and Education: Serving as a foundation for VLM and NLP research, language learning tools, and knowledge exploration.