Overview
orai-nlp/Gemma-Kimu-2b-it is a 2.6 billion parameter instruction-tuned large language model (LLM) specifically designed for the Basque language. Built upon Google's Gemma-2-2b foundational and instruction models, it employs a unique approach that decouples language adaptation from post-training alignment. The model first undergoes continual pre-training on Basque monolingual data, anchored by English replay, to improve its linguistic capacity. Subsequently, instruction-following capabilities are injected via delta-based weight merging from the instructed counterpart of the base LLM, effectively transferring both instruction-following and human preference alignment.
Key Capabilities
- Basque Language Proficiency: Significantly enhanced linguistic capacity and instruction following in Basque.
- Instruction Following: Demonstrates notable improvements in instruction adherence compared to the base Gemma-2-2b-it model in Basque.
- Safety and Linguistic Correctness: Exhibits improved safety and linguistic accuracy in Basque outputs.
- Efficient Adaptation: Utilizes a method of continual pre-training and delta-based weight merging for efficient adaptation to low-resource languages.
Training Details
The model was continually pre-trained using a combination of Basque and English datasets:
- ZelaiHandi: A large collection of approximately 521 million words (1.5 billion tokens) of high-quality, freely licensed Basque texts.
- FineWeb: A subset of around 300 million tokens from the 15T token English web data, used for anchoring and maintaining English capabilities.
Evaluation
Evaluated using the NoRobotsEU benchmark, a manually translated subset of the NoRobots test set comprising 100 Basque instructions across 9 categories. Gemma-Kimu-2b-it scored 48 on "Instruct follow. EU" compared to Gemma-2-2b-it's 7, showcasing substantial improvement in Basque instruction following.
Good For
- Applications requiring robust instruction-following and generation in the Basque language.
- Developers looking for an LLM optimized for low-resource language tasks, specifically Basque.
- Research into efficient cross-lingual transfer and adaptation of LLMs.