GaMS-Beta/GaMS3-12B-Multimodal

VISIONConcurrency Cost:1Model Size:12BQuant:FP8Ctx Length:32kPublished:Mar 31, 2026License:mitArchitecture:Transformer Open Weights Cold

GaMS-Beta/GaMS3-12B-Multimodal is a 12 billion parameter Vision Language Model (VLM) based on the Google Gemma-3-12B-IT architecture. Developed by GaMS-Beta, this model is specifically fine-tuned for Slovenian, excelling in vision question answering tasks. It processes both text and image inputs, making it suitable for multimodal applications requiring understanding of visual content in Slovenian.

Loading preview...

GaMS3-12B-Multimodal: Slovenian Vision Language Model

GaMS3-12B-Multimodal is a 12 billion parameter Vision Language Model (VLM) developed by GaMS-Beta. It is built upon the google/gemma-3-12b-it base model, enhancing its capabilities for multimodal understanding.

Key Capabilities

  • Multimodal Input: Processes both image and text inputs simultaneously.
  • Vision Question Answering: Specialized in answering questions based on visual content.
  • Slovenian Language Support: Fine-tuned specifically for the Slovenian language, enabling native understanding and generation.
  • Custom SFT Training: Developed using a custom Supervised Fine-Tuning (SFT) trainer on a curated instruction-tuning text-image Slovenian dataset.

Use Cases

  • Image Captioning in Slovenian: Generating descriptive text for images.
  • Visual Search: Answering queries about image content in Slovenian.
  • Educational Tools: Creating interactive learning materials that combine visuals and Slovenian text.
  • Content Moderation: Identifying and understanding visual content with Slovenian context.