CALISTA-INDUSTRY/gemma_3_1B_reasoning_en_ft_v1
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1BQuant:BF16Ctx Length:32kPublished:May 21, 2025License:gemmaArchitecture:Transformer Warm

CALISTA-INDUSTRY/gemma_3_1B_reasoning_multimodal_en_ft_v2 is a 1 billion parameter fine-tuned Gemma3 model developed by Mohammad Yani & Rizky Sulaeman from Politeknik Negeri Indramayu. This model is enhanced for multimodal reasoning, integrating both visual and textual inputs to perform complex tasks. It excels in applications requiring understanding and interpretation of combined modalities, such as Visual Question Answering and multimodal dialogue systems. The model has a context length of 32768 tokens and is licensed under Apache 2.0.

Loading preview...

Model Overview

CALISTA-INDUSTRY/gemma_3_1B_reasoning_multimodal_en_ft_v2 is a 1 billion parameter multimodal large language model, fine-tuned from the Gemma3 architecture by Mohammad Yani & Rizky Sulaeman at Politeknik Negeri Indramayu. This model is specifically designed to process and reason over both visual and textual inputs, making it suitable for complex tasks that require understanding combined modalities. It operates primarily in English and is released under the Apache 2.0 license.

Key Capabilities

  • Multimodal Reasoning: Integrates visual and textual information to perform advanced reasoning tasks.
  • Visual Question Answering (VQA): Can answer questions based on provided images.
  • Image Captioning: Generates descriptive captions for images.
  • Multimodal Dialogue Systems: Supports conversational interactions involving both text and images.
  • Instruction Following: Capable of executing instructions that include visual components.

Intended Uses

This model is well-suited for applications such as:

  • Developing intelligent agents that can interpret visual scenes and respond textually.
  • Creating systems for automated image analysis and description.
  • Building interactive experiences where users can query images with natural language.

Limitations

  • Performance may be suboptimal with non-English inputs.
  • Generalization to domains significantly different from its training data may be limited.
  • Not optimized for real-time applications without further fine-tuning or infrastructure.