orai-nlp/Gemma-Kimu-2b-base
Gemma-Kimu-2b-base is a 2.6 billion parameter continually pre-trained large language model developed by orai-nlp, built upon Google's Gemma-2-2b foundational architecture. This base model is specifically adapted for the Basque language, enhancing its syntactic, lexical, and morphological competence while preserving English performance through a combination of Basque monolingual data and English replay. It serves as a foundational model for subsequent instruction-tuned versions, demonstrating significant improvements in Basque language understanding and generation fluency.
Loading preview...
Gemma-Kimu-2b-base: Basque Language Adaptation
Gemma-Kimu-2b-base is a 2.6 billion parameter continually pre-trained large language model (LLM) developed by orai-nlp. It is built upon Google's Gemma-2-2b foundational model and focuses exclusively on language adaptation for Basque, without instruction-following alignment. This model serves as a robust base for future instruction-tuned versions, such as Gemma-Kimu-2b-it.
Key Capabilities and Training
- Basque Language Enhancement: The model undergoes continual pre-training using a combination of Basque monolingual data and English replay. This process significantly improves its syntactic, lexical, and morphological competence in Basque.
- Performance Preservation: The English replay strategy ensures that the model maintains its proficiency in English, facilitating cross-lingual transfer.
- Improved Basque Fluency: Evaluations indicate that Gemma-Kimu-2b-base shows substantial improvements in Basque language understanding, coherence, and text generation fluency compared to the original Gemma-2-2b.
- Training Data: Utilizes the ZelaiHandi dataset (521 million Basque words) and a subset of the FineWeb dataset (300 million English tokens) for its continual pre-training.
Good For
- Developing Basque-centric LLMs: Ideal as a foundational model for creating instruction-tuned or task-specific models tailored for the Basque language.
- Research in Low-Resource Language Adaptation: Useful for researchers exploring methods of adapting large language models to languages with fewer digital resources.
- Applications Requiring Strong Basque Linguistic Understanding: Suitable for tasks demanding high accuracy in Basque text generation, comprehension, and morphological analysis.