orai-nlp/Gemma-Kimu-2b-it
Gemma-Kimu-2b-it is a 2.6 billion parameter instruction-tuned large language model developed by orai-nlp, based on Google's Gemma-2-2b architecture. It is specifically tailored for the Basque language, continually pre-trained on Basque and English data to enhance linguistic capacity while retaining English proficiency. This model excels in instruction following, safety, and linguistic correctness in Basque, making it suitable for applications requiring strong performance in this low-resource language.
Loading preview...
Overview
orai-nlp/Gemma-Kimu-2b-it is a 2.6 billion parameter instruction-tuned large language model (LLM) specifically designed for the Basque language. Built upon Google's Gemma-2-2b foundational and instruction models, it employs a unique approach that decouples language adaptation from post-training alignment. The model first undergoes continual pre-training on Basque monolingual data, anchored by English replay, to improve its linguistic capacity. Subsequently, instruction-following capabilities are injected via delta-based weight merging from the instructed counterpart of the base LLM, effectively transferring both instruction-following and human preference alignment.
Key Capabilities
- Basque Language Proficiency: Significantly enhanced linguistic capacity and instruction following in Basque.
- Instruction Following: Demonstrates notable improvements in instruction adherence compared to the base Gemma-2-2b-it model in Basque.
- Safety and Linguistic Correctness: Exhibits improved safety and linguistic accuracy in Basque outputs.
- Efficient Adaptation: Utilizes a method of continual pre-training and delta-based weight merging for efficient adaptation to low-resource languages.
Training Details
The model was continually pre-trained using a combination of Basque and English datasets:
- ZelaiHandi: A large collection of approximately 521 million words (1.5 billion tokens) of high-quality, freely licensed Basque texts.
- FineWeb: A subset of around 300 million tokens from the 15T token English web data, used for anchoring and maintaining English capabilities.
Evaluation
Evaluated using the NoRobotsEU benchmark, a manually translated subset of the NoRobots test set comprising 100 Basque instructions across 9 categories. Gemma-Kimu-2b-it scored 48 on "Instruct follow. EU" compared to Gemma-2-2b-it's 7, showcasing substantial improvement in Basque instruction following.
Good For
- Applications requiring robust instruction-following and generation in the Basque language.
- Developers looking for an LLM optimized for low-resource language tasks, specifically Basque.
- Research into efficient cross-lingual transfer and adaptation of LLMs.