cjvt/GaMS-2B-Instruct

Warm
Public
2.6B
BF16
8192
Mar 28, 2025
License: gemma
Hugging Face
Overview

GaMS-2B-Instruct: A Multilingual Instruction-Tuned Model

The cjvt/GaMS-2B-Instruct is a 2 billion parameter instruction-tuned model from the GaMS (Generative Model for Slovene) family, developed by researchers at the University of Ljubljana. It is built upon Google's Gemma 2 architecture and has undergone extensive continual pretraining to enhance its capabilities across multiple languages.

Key Capabilities

  • Multilingual Proficiency: Primarily supports Slovene and English, with additional proficiency in Croatian, Bosnian, and Serbian due to dedicated pretraining. It may also perform in other languages supported by the base Gemma 2 model.
  • Instruction Following: This is the Supervised Fine-Tuned (SFT) version, optimized for understanding and executing instructions.
  • Specialized Training Data: Continually pretrained in two stages, first aligning parallel English-Slovene (and some Croatian) corpora, then training on separate large corpora for Slovene, English, Croatian, Bosnian, and Serbian, totaling over 13 billion tokens.
  • SFT Data Diversity: Fine-tuned on a diverse dataset including GaMS-Instruct datasets (general, digital humanities, medical), filtered parallel English-Slovene translations, selected examples from the Aya Dataset, and manually corrected math competition problems.

Evaluation Highlights

While GaMS-2B-Instruct shows competitive performance on some tasks, its larger counterparts (GaMS-9B and GaMS-27B) generally achieve higher scores on the Slovene SuperGLUE benchmark. For English to Slovene translation, GaMS-2B-Instruct achieved a BERT score of 0.8554 and a BLEU (corpus) of 0.2715. For Slovene to English translation, it scored a BERT score of 0.9392 and a BLEU (corpus) of 0.2789.

Good For

  • Applications requiring instruction-following in Slovene and related South Slavic languages.
  • Research and development in multilingual NLP, especially for low-resource languages.
  • Tasks benefiting from a model continually pretrained on a significant volume of Slovene and regional language data.