pankajmathur/Mimma-3-4b-v3

VISIONConcurrency Cost:1Model Size:4.3BQuant:BF16Ctx Length:32kLicense:gemmaArchitecture:Transformer0.0K Cold

Mimma-3-4b-v3 is a multimodal vision-language model developed by pankajmathur, based on the Gemma 3 architecture. This model integrates text and image understanding, generating text output from both modalities. It features a large 128K context window and multilingual support across over 140 languages, making it suitable for diverse text generation and image analysis tasks like question answering, summarization, and reasoning.

Loading preview...

Mimma-3-4b-v3: A Multimodal Gemma 3 Variant

Mimma-3-4b-v3 is a multimodal vision-language model (VLM) developed by pankajmathur, building upon Google's Gemma 3 architecture. This model is designed to process both text and image inputs, generating text-based responses. It leverages the capabilities of the Gemma 3 family, which are known for their lightweight design and open weights.

Key Capabilities

  • Multimodal Input: Accepts both text strings (questions, prompts) and images (normalized to 896x896 resolution, encoded to 256 tokens each).
  • Text Generation: Capable of generating diverse text outputs, including answers, summaries, and creative content.
  • Large Context Window: Features a substantial 128K token context window, enabling processing of extensive inputs.
  • Multilingual Support: Supports over 140 languages, enhancing its applicability in global contexts.
  • Instruction-Tuned: Designed to follow instructions effectively, as demonstrated by its use with chat templates for conversational tasks.

Good For

  • Image Understanding: Analyzing image content and extracting information, such as identifying objects or describing scenes.
  • Text Generation: Creating various forms of text, from summaries and answers to more creative formats.
  • Conversational AI: Powering chatbots and interactive applications that require multimodal input.
  • Resource-Limited Environments: Its relatively smaller size (4B parameters) makes it suitable for deployment on devices with limited computational resources, such as laptops or local cloud infrastructure.