Model Overview
v1v1d1/nayana-gemma3-4b-stage1 is a 4.3 billion parameter Vision-Language Model (VLM) built upon the google/gemma-3-4b-it base model. It has been fine-tuned using LoRA (Low-Rank Adaptation) within the MS-Swift framework, specifically targeting multimodal understanding.
Key Capabilities
- Multimodal Understanding: Processes both image and text inputs to generate detailed descriptions and responses.
- Multilingual Support: Trained on data in English (en), Kannada (kn), and Hindi (hi), enabling understanding and generation in these languages.
- Efficient Fine-tuning: Leverages LoRA with a rank of 16 and alpha of 32, allowing for efficient adaptation of the base model.
- Optimized for Document Analysis: Fine-tuned on the Nayana Docmatix Stage 1 dataset, suggesting a focus on document-related visual and textual tasks.
Training Details
The model was trained for 1.0 epoch on a 150k sample dataset, with a learning rate of 0.0001 and a batch size of 2 (with 4 gradient accumulation steps). The LoRA configuration targeted all linear modules, with the ViT and Aligner components frozen during training.
Usage Considerations
This model is suitable for applications requiring image description, visual question answering, or document understanding, particularly in English, Kannada, and Hindi contexts. It can be deployed efficiently using vLLM or Transformers libraries.