Mimma-3-12b: A Multimodal Gemma 3 Model
Mimma-3-12b, developed by pankajmathur, is a multimodal vision-language model built upon the Gemma 3 architecture. This model is designed to process both text and image inputs, generating text outputs. It is part of the Gemma 3 family, which are lightweight, open models from Google, leveraging the same research and technology as the Gemini models.
Key Capabilities
- Multimodal Input: Accepts text strings and images (normalized to 896x896 resolution, encoded to 256 tokens each).
- Large Context Window: Features a substantial 128K token context window, enabling processing of extensive inputs.
- Multilingual Support: Trained on data including content in over 140 languages.
- Diverse Task Performance: Well-suited for text generation and image understanding tasks such as question answering, summarization, and reasoning.
- Optimized for Resources: Its relatively smaller size makes it deployable in environments with limited resources, like laptops or desktops.
Training and Performance
The underlying Gemma 3 12B model was trained on 12 trillion tokens, encompassing web documents, code, mathematics, and images. It demonstrates strong performance across various benchmarks, including reasoning (e.g., 84.2 on HellaSwag), STEM and code (e.g., 74.5 on MMLU, 71.0 on GSM8K), and multimodal tasks (e.g., 111 on COCOcap, 82.3 on DocVQA).
Good for
- Content Creation: Generating creative text formats, marketing copy, or email drafts.
- Conversational AI: Powering chatbots and virtual assistants.
- Text Summarization: Creating concise summaries of documents or research papers.
- Image Data Extraction: Interpreting and summarizing visual data for text communications.
- Research and Education: Serving as a foundation for VLM and NLP research, language learning tools, and knowledge exploration.