GaMS-27B: Multilingual Gemma 2 for Balkan Languages
GaMS-27B is a 27 billion parameter model from the GaMS family, developed by researchers at the University of Ljubljana, Faculty for Computer and Information Science. It is built upon Google's Gemma 2 27B base model and has undergone extensive continual pretraining.
Key Capabilities & Differentiators
- Multilingual Proficiency: Primarily focused on Slovene and English, with significant pretraining on Croatian, Bosnian, and Serbian corpora, making it highly effective for these languages.
- Continual Pretraining: The model was continually pre-trained in two stages: first, using parallel English-Slovene (and some Croatian) corpora for language alignment, followed by training on separate English, Slovene, Croatian, Bosnian, and Serbian datasets.
- Performance: On the Slovene SuperGLUE benchmark, GaMS-27B achieved an average score of 0.7601, outperforming the base Gemma 2 27B model and other Slovene-specific models like PrešernGPT 0.1 and SlovenianGPT-Chat.
- Context Length: Supports a context length of 32768 tokens.
Ideal Use Cases
- Content Generation: Excellent for generating text in Slovene, English, Croatian, Bosnian, and Serbian.
- Multilingual Applications: Suitable for applications requiring understanding and generation across these specific languages.
- Research & Development: A strong foundation for further fine-tuning or research in natural language processing for the Balkan region.
Training Details
The model was trained on the Booster partition of Leonardo HPC using the NVIDIA NeMo 2.0 framework, utilizing BF16-Mixed precision across 32 nodes with 4 A100 64GB GPUs each.