GaMS-27B-Instruct: Multilingual Instruction-Tuned Model
GaMS-27B-Instruct is a 27 billion parameter instruction-tuned model developed by researchers at the University of Ljubljana, Faculty for Computer and Information Science. It is part of the GaMS (Generative Model for Slovene) family, built upon Google's Gemma 2 architecture, and features a substantial context length of 32768 tokens.
Key Capabilities & Training
This model has undergone extensive continual pre-training (CPT) and supervised fine-tuning (SFT) to enhance its multilingual abilities. Its CPT involved two stages:
- Parallel Alignment: Initial training on English-Slovene (and some Croatian) parallel corpora, including datasets like KAS Abstracts, DGT, MaCoCu Parallel, and CC-News, totaling 1.91 billion tokens.
- Second Stage: Further training on separate English, Slovene, Croatian, Bosnian, and Serbian corpora, accumulating 13.62 billion tokens from sources like KAS, MetaFida, and Wikipedia.
The SFT phase utilized approximately 25,000 training examples from diverse datasets such as GaMS-Instruct-GEN, GaMS-Instruct-DH, GaMS-Instruct-MED, and filtered parallel corpora, including a unique collection of 150 solved Slovene national math problems.
Performance & Use Cases
GaMS-27B-Instruct demonstrates strong performance in Slovene language tasks and translation. On the SloBench English to Slovene translation benchmark, it ranks 6th, outperforming several other models. For Slovene to English translation, it secures the 4th position. While its performance on the Slovene SuperGLUE classification tasks is competitive, it particularly shines in translation and generation for its target languages.
Good for:
- Multilingual Text Generation: Especially for Slovene, English, Croatian, Bosnian, and Serbian content creation.
- Translation Tasks: Excels in English-Slovene and Slovene-English translation.
- Research & Development: A robust base for further fine-tuning on specific tasks within its supported languages.
- Conversational AI: Powering chatbots and virtual assistants in the specified languages.